ADSP

From Bradwiki
Jump to: navigation, search

The Alzheimer's Disease Sequencing Project (ADSP) has made available WES data in vcf format. The following is a step-by-step analysis walkthrough of this dataset. Here the ultimate goal is to develop a platform for Alzheimer's Disease diagnosis based on genome sequencing information (i.e. leaning heavily on machine learning methods). The data for this analysis can be downloaded here:

ADSP_WES_VCF_LATEST_RELEASE.mat.

Both the dataset and the following analysis code are in MATLAB format. Note however there is an R package for reading .mat files. All MATLAB code and custom functions used in the demo walkthrough below can be downloaded here:

GENOS GITHUB CODE REPO
Other Analyses




ADSP Data Compilation


ADSP has combined sequencing data from 25 different consortium studies. Because of this, major stratification issues lurk in this dataset. Further, many of these cohorts provide very unequal numbers of Alzheimer's patients (CASE) and controls (CTRL). So to get started, let's first look at the distribution of data among the different cohorts, and then actually load the mat-file dataset into a fresh MATLAB workspace.

The table below summarizes what- and how much data each cohorts/consortium/study group has contributed to this data compilation.

COHID CONSORT STUDYNAME STUDY FAM PEOPLE ENR CASE CTRL TOTAL PCTCASE PCTCTRL CACOEQ CULL NOTES
1 DGC Adult_Chng ACT 0 1277 0 323 945 1268 25 75 323 -622 KEEP
2 ADGC AD_Centers ADC 0 3263 0 2438 817 3255 75 25 817 1621 KEEP
3 CHARGE Athrosclro ARIC 0 57 0 39 18 57 68 32 18 21 REMOVE
4 CHARGE Aus_Stroke ASPS 0 126 0 121 5 126 96 4 5 116 REMOVE
5 ADGC ChiT_Aging CHAP 0 231 0 27 204 231 12 88 27 -177 REMOVE
6 CHARGE CardioHlth CHS 0 840 0 250 583 833 30 70 250 -333 KEEP
7 ADGC Hispanic_S CUHS 361 0 331 160 171 331 48 52 160 -11 HISP
8 CHARGE Erasmus_Er ERF 30 45 0 45 0 45 100 0 0 45 REMOVE
9 CHARGE Framingham FHS 0 581 0 157 424 581 27 73 157 -267 KEEP
10 ADGC Gene_Diffs GDF 0 207 0 111 96 207 54 46 96 15 KEEP
11 ADGC NIA_LOAD LOAD 80 113 363 367 109 476 77 23 109 258 KEEP
12 ADGC Aging_Proj MAP 0 415 0 138 277 415 33 67 138 -139 KEEP
13 ADGC Mayo_Clini MAYO 0 349 0 250 99 349 72 28 99 151 KEEP
14 ADGC Miami_Univ MIA 61 181 19 186 14 200 93 7 14 172 REMOVE
15 ADGC AD_Genetic MIR 0 284 47 316 15 331 95 5 15 301 REMOVE
16 ADGC Mayo_cl_PD MPD 0 20 0 0 20 20 0 100 0 -20 REMOVE
17 ADGC NationC_AD NCRD 18 108 52 160 0 160 100 0 0 160 REMOVE
18 ADGC Wash_Unive RAS 0 0 46 46 0 46 100 0 0 46 REMOVE
19 ADGC Relig_Ordr ROS 0 351 0 154 197 351 44 56 154 -43 KEEP
20 CHARGE RotterdamS RS 0 1089 0 276 813 1089 25 75 276 -537 KEEP
21 ADGC Texas_AD_S TARC 0 144 0 132 12 144 92 8 12 120 REMOVE
22 ADGC Un_Toronto TOR 0 0 9 9 0 9 100 0 0 9 REMOVE
23 ADGC Vanderbilt VAN 6 235 1 210 26 236 89 11 26 184 REMOVE
24 ADGC WashNY_Age WCAP 0 147 3 34 116 150 23 77 34 -82 REMOVE
25 ADGC Univ_Penns UPN 40 0 3 0 0 0 0 0 0 0 REMOVE

In particular, inspect the CASE and CTRL columns. Everything else is provided in white-paper by ADSP, but the CASE and CTRL columns contain values that I've tallied using the actual data. That is, I went through every single variant loci in the vcf files and made a list of all the participant ID numbers found anywhere in the dataset. I then found which study each of those ID numbers belonged to, and tallied the number of cases and controls for each of the 25 study-cohorts.


Load Dataset Into Workspace


This first code segment will load the ADSPdata.mat file into the workspace. There is just one prerequisite here: this walkthrough assumes the script you are working in is called GENOS.m, and is saved somewhere in your MATLAB path, preferably in the same directory as the ADSPdata.mat file.

 1 %% LOAD ADSP DATASET
 2 
 3 clc; close all; clear; rng('shuffle')
 4 cd(fileparts(which('GENOS.m')));
 5 
 6 load('ADSPdata.mat');
 7 
 8 clearvars -except ADSP
 9 
10 
11 %% CARBON COPY VARIABLES FROM ADSP.STRUCT
12 
13 LOCI = ADSP.LOCI;
14 CASE = ADSP.CASE;
15 CTRL = ADSP.CTRL;
16 PHEN = ADSP.PHEN;
17 
18 clearvars -except ADSP LOCI CASE CTRL PHEN


The variables that we copied from the ADSP data struct include:

  • LOCI
  • CASE
  • CTRL
  • PHEN

LOCI is a summary table for each variant loci in the ADSP dataset. Depending on whether this particular dataset has been pre-processed to remove some variant loci, there could be over 1 million rows in this table. Columns of this table include chromosome, position, reference allele, alternate allele, and annotation information like gene, coding effect, etc.

CASE & CTRL are cell arrays. Each should have exactly the same number of cells as there are rows in LOCI. This is because they contain the SRR ID numbers of each person who has a variant at each given locus. Along with SRR, each cell also indicates whether the person is heterozygous alternate (1) or homozygous alternate (2).

PHEN is a table with one row per person, for all known people in the ADSP dataset. Columns of this table include phenotype information; and perhaps most importantly indicate whether a given person (SRR) is a case (AD==1) or a control (AD==0).


Sort LOCI Table

Let's sort the LOCI table by a Fisher's Exact Test P-value that was computed for all participants in the entire ADSP dataset. Importantly, any time we manipulate LOCI we should also make sure to perform the same operation on the CASE and CTRL, since later operations will expect that those cell arrays hold the SRR info for each row of the LOCI variant table.

19 %% SORT VARIANT LOCI TABLE BY FISHER P-VALUE
20 
21 [X,i] = sort(LOCI.FISHP);
22 
23 LOCI  = LOCI(i,:);
24 CASE  = CASE(i);
25 CTRL  = CTRL(i);
26 LOCI.VID = (1:size(LOCI,1))';
27 
28 LOCI.GENE = string(LOCI.GENE);
29 LOCI = LOCI(:,[1:7 12:17]);
30 
31 
32 clc; clearvars -except ADSP LOCI CASE CTRL PHEN
33 disp(LOCI(1:9,:))

ADSP Top Variants.png


Neural Net Training and Testing Cohorts


In this section we will begin to build the list of people to be used to train and test a neural network classifier. These lists will simply be the PHEN table split up into a list of CASE and CTRL individuals for training the classifier, and another list of CASE and CTRL individuals reserved for testing the performance of the classifier after supervised machine learning has been conducted.

 1 %% IDENTIFY KEEPER PHEN AND MAKE CASE:CTRL EQUAL SIZE
 2 % 1. Identify cohorts in PHEN with a reasonable number of CASE and CTRL
 3 % 2. Take a random selection of people from those cohorts so that
 4 %    there are the same number of CASE and CTRL.
 5 % 3. The rest of the people from the keeper cohorts can be used to
 6 %    increase the size of the NN test group.
 7 
 8 
 9 
10 % TAG GOOD PHEN
11 GOODCOH = [1 2 6 9 10 11 12 13 19 20]';
12 
13 [ai,bi] = ismember(PHEN.COHORTNUM,GOODCOH);
14 
15 PHEN.GOODCOH = ai;
16 
17 
18 % PUT GOOD PHEN INTO SEPARATE SET
19 COHSET = PHEN(PHEN.GOODCOH==1,:);     % 8824 TOTAL PEOPLE
20 
21 
22 
23 
24 % COUNT NUMBER OF CASE AND CTRL IN GOOD PHEN
25 cohNums = unique(COHSET.COHORTNUM);
26 
27 nCA=[];nCO=[];
28 for nn = 1:numel(cohNums)
29     nCA(nn) = sum(  (COHSET.COHORTNUM==cohNums(nn)) & (COHSET.AD==1)  );
30     nCO(nn) = sum(  (COHSET.COHORTNUM==cohNums(nn)) & (COHSET.AD==0)  );
31 end
32 
33 nCACO = [nCA' nCO'];
34 minCACO = min(nCACO,[],2);
35 
36 
37 
38 % CREATE COHORT TABLE FOR EACH COHORT
39 COHCASE={};COHCTRL={};
40 for nn = 1:numel(cohNums)
41     COHCASE{nn} = COHSET( (COHSET.COHORTNUM==cohNums(nn)) & (COHSET.AD==1) ,:);
42     COHCTRL{nn} = COHSET( (COHSET.COHORTNUM==cohNums(nn)) & (COHSET.AD==0) ,:);
43 end
44 
45 
46 
47 % GET RANDOM ROW NUMBER FOR EACH COHORT, EQUAL CASE AND CTRL COUNT PER COHORT
48 % Get random permutation of M values between 1:N-5
49 %    where  M = min CASE\CTRL group size per cohort
50 %           N = total cohort size
51 rca={};rco={};
52 for nn = 1:numel(cohNums)
53 rca{nn} = randperm( nCACO(nn,1)  , minCACO(nn)-5);
54 rco{nn} = randperm( nCACO(nn,2)  , minCACO(nn)-5);
55 end
56 
57 
58 
59 % GET ROW INDEX NUMBER FOR PEOPLE NOT CHOSEN ABOVE
60 % Get index locations for people not part of the training subgroup
61 ica={};ico={};
62 for nn = 1:numel(cohNums)
63 [ia,ib] = ismember(  (1:nCACO(nn,1)) , rca{nn});
64 [ic,id] = ismember(  (1:nCACO(nn,2)) , rco{nn});
65 ica{nn} = find(~ia);
66 ico{nn} = find(~ic);
67 end
68 
69 
70 
71 % Make the final set of people in the training and testing groups
72 COHTRANCA={};COHTRANCO={};
73 COHTESTCA={};COHTESTCO={};
74 for nn = 1:numel(cohNums)
75     COHTRANCA{nn} = COHCASE{nn}(rca{nn},:);
76     COHTRANCO{nn} = COHCTRL{nn}(rco{nn},:);
77 
78     COHTESTCA{nn} = COHCASE{nn}(ica{nn},:);
79     COHTESTCO{nn} = COHCTRL{nn}(ico{nn},:);
80 end
81 
82 
83 
84 % Turn the cell array back into a single table
85 COHTRANCASE = vertcat(COHTRANCA{:});
86 COHTRANCTRL = vertcat(COHTRANCO{:});
87 
88 COHTESTCASE = vertcat(COHTESTCA{:});
89 COHTESTCTRL = vertcat(COHTESTCO{:});
90 
91 
92 
93 clearvars -except ADSP LOCI CASE CTRL PHEN...
94 COHTRANCASE COHTRANCTRL COHTESTCASE COHTESTCTRL
95 disp(COHTRANCASE(1:9,:))

900PX

Per-Person Variant Counts


The total number of variants has already been counted for each person in the dataset and listed in the 'PHEN' table under 'TOTvars'. If you ever need to recount, a custom function has been coded to do this for you...

   [VNUM] = countvperper(SRR, AD, COHORTNUM, CASE, CTRL);

Here we will use the per-person counts already performed and dismiss the individuals on either extreme.

 1 %%  USE PER-PERSON VARIANT COUNTS TO IDENTIFY OUTLIERS
 2 
 3 Qlo = quantile(COHTRANCASE.TOTvars,.01);
 4 Qhi = quantile(COHTRANCASE.TOTvars,.99);
 5 CTRCASE  = COHTRANCASE(((COHTRANCASE.TOTvars > Qlo) & (COHTRANCASE.TOTvars < Qhi)),:);
 6 
 7 
 8 Qlo = quantile(COHTRANCTRL.TOTvars,.01);
 9 Qhi = quantile(COHTRANCTRL.TOTvars,.99);
10 CTRCTRL  = COHTRANCTRL(((COHTRANCTRL.TOTvars > Qlo) & (COHTRANCTRL.TOTvars < Qhi)),:);
11 
12 
13 Qlo = quantile(COHTESTCASE.TOTvars,.01);
14 Qhi = quantile(COHTESTCASE.TOTvars,.99);
15 CTECASE  = COHTESTCASE(((COHTESTCASE.TOTvars > Qlo) & (COHTESTCASE.TOTvars < Qhi)),:);
16 
17 
18 Qlo = quantile(COHTESTCTRL.TOTvars,.01);
19 Qhi = quantile(COHTESTCTRL.TOTvars,.99);
20 CTECTRL  = COHTESTCTRL(((COHTESTCTRL.TOTvars > Qlo) & (COHTESTCTRL.TOTvars < Qhi)),:);
21 
22 
23 
24 clc; close all;
25 subplot(2,2,1), histogram(CTRCASE.TOTvars); title('TRAIN CASE')
26 subplot(2,2,2), histogram(CTRCTRL.TOTvars); title('TRAIN CTRL')
27 subplot(2,2,3), histogram(CTECASE.TOTvars); title('TEST  CASE')
28 subplot(2,2,4), histogram(CTECTRL.TOTvars); title('TEST  CTRL')
29 disp('------------------------------------')
30 disp('TOTAL VARIANTS PER-PERSON')
31 disp('                MIN    MAX')
32 fprintf('TRAINING CASE: %.0f  %.0f \n',min(CTRCASE.TOTvars),  max(CTRCASE.TOTvars))
33 fprintf('TRAINING CTRL: %.0f  %.0f \n',min(CTRCTRL.TOTvars),  max(CTRCTRL.TOTvars))
34 fprintf('TESTING  CASE: %.0f  %.0f \n',min(CTECASE.TOTvars),  max(CTECASE.TOTvars))
35 fprintf('TESTING  CTRL: %.0f  %.0f \n',min(CTECTRL.TOTvars),  max(CTECTRL.TOTvars))
36 disp('------------------------------------')
37 
38 clearvars -except ADSP GENB LOCI CASE CTRL PHEN...
39 COHTRANCASE COHTRANCTRL COHTESTCASE COHTESTCTRL...
40 CTRCASE CTRCTRL CTECASE CTECTRL

Variant Count Histograms.png


Count Variants Per Loci


Now that we have created a set of individuals for training and a set for testing the neural net classifier, we will want to tally the number of variants at each loci separately for the training group and the testing group. Using these tallies we will be able to compute a Fisher's Exact P-value and Odds Ratio, to help us identify which subset of variants to use during classifier training.

 1 %%          COUNT NUMBER OF VARIANTS PER LOCI
 2 % The varsum() function will go through each known variant loci
 3 % and check whether anyone's SRR ID from your subset of IDs match
 4 % all known SRR IDs for that loci. It will then sum the total
 5 % number of alleles (+1 for hetzy-alt, +2 for homzy-alt) for each
 6 % loci and return the totals.
 7 
 8 
 9 [TRCASEN, TRCTRLN] = varsum(CASE, CTRCASE.SRR, CTRL, CTRCTRL.SRR);
10 [TECASEN, TECTRLN] = varsum(CASE, CTECASE.SRR, CTRL, CTECTRL.SRR);
11 
12 
13 % SAVE COUNTS AS NEW TABLE COLUMNS
14 LOCI.TRCASEREF = numel(CTRCASE.SRR)*2-TRCASEN;
15 LOCI.TRCTRLREF = numel(CTRCTRL.SRR)*2-TRCTRLN;
16 LOCI.TRCASEALT = TRCASEN;
17 LOCI.TRCTRLALT = TRCTRLN;
18 
19 
20 LOCI.TECASEREF = numel(CTECASE.SRR)*2-TECASEN;
21 LOCI.TECTRLREF = numel(CTECTRL.SRR)*2-TECTRLN;
22 LOCI.TECASEALT = TECASEN;
23 LOCI.TECTRLALT = TECTRLN;
24 
25 
26 close all
27 fh01 = figure('Units','normalized','OuterPosition',[.01 .25 .48 .62],'Color','w');
28 ax01 = axes('Position',[.06 .06 .9 .9],'Color','none');
29 ph01 = histogram(TRCASEN(TRCASEN>5), 40,'DisplayStyle','stairs',...
30     'LineWidth',3,'Normalization','count'); hold on;
31 ph02 = histogram(TRCTRLN(TRCTRLN>5), 40,'DisplayStyle','stairs',...
32     'LineWidth',3,'Normalization','count'); hold on;
33 ph03 = histogram(TECASEN(TECASEN>5), 40,'DisplayStyle','stairs',...
34     'LineWidth',3,'Normalization','count'); hold on;
35 ph04 = histogram(TECTRLN(TECTRLN>5), 40,'DisplayStyle','stairs',...
36     'LineWidth',3,'Normalization','count'); hold on;
37 title('DISTRIBUTION OF ALT ALLELES')
38 
39 fh21 = figure('Units','normalized','OuterPosition',[.5 .25 .48 .62],'Color','w');
40 ax21 = axes('Position',[.06 .06 .9 .9],'Color','none');
41 ph21 = histogram(LOCI.TRCASEREF(LOCI.TRCASEREF<max(LOCI.TRCASEREF)-10),...
42     40,'DisplayStyle','stairs','LineWidth',3,'Normalization','count'); hold on;
43 ph22 = histogram(LOCI.TRCTRLREF(LOCI.TRCTRLREF<max(LOCI.TRCTRLREF)-10),...
44     40,'DisplayStyle','stairs','LineWidth',3,'Normalization','count'); hold on;
45 ph23 = histogram(LOCI.TECASEREF(LOCI.TECASEREF<max(LOCI.TECASEREF)-10),...
46     40,'DisplayStyle','stairs','LineWidth',3,'Normalization','count'); hold on;
47 ph24 = histogram(LOCI.TECTRLREF(LOCI.TECTRLREF<max(LOCI.TECTRLREF)-10),...
48     40,'DisplayStyle','stairs','LineWidth',3,'Normalization','count'); hold on;
49 title('DISTRIBUTION OF REF ALLELES')
50 pause(2)
51 
52 clearvars -except ADSP GENB LOCI CASE CTRL PHEN...
53 COHTRANCASE COHTRANCTRL COHTESTCASE COHTESTCTRL...
54 CTRCASE CTRCTRL CTECASE CTECTRL TRCASEN TRCTRLN TECASEN TECTRLN


Distribution of alt alleles.png Distribution of ref alleles.png


Compute Fisher's P-value


Next we will use the variant counts tallied in the section above to compute a Fisher's Exact Test P-value. This will be done separately for the training cohort and testing cohort.

 1 % COMPUTE FISHERS STATISTICS FOR THE TRAINING GROUP
 2 [FISHP, FISHOR] = fishp_mex(LOCI.TRCASEREF,LOCI.TRCASEALT,...
 3                             LOCI.TRCTRLREF,LOCI.TRCTRLALT);
 4 
 5 LOCI.TRFISHP  = FISHP;
 6 LOCI.TRFISHOR = FISHOR;
 7 
 8 
 9 % COMPUTE FISHERS STATISTICS FOR THE TRAINING GROUP
10 [FISHP, FISHOR] = fishp_mex(LOCI.TECASEREF, LOCI.TECASEALT,...
11                             LOCI.TECTRLREF, LOCI.TECTRLALT);
12 
13 LOCI.TEFISHP  = FISHP;
14 LOCI.TEFISHOR = FISHOR;
15 
16 
17 close all
18 x = -log(LOCI.TRFISHP);
19 y = -log(LOCI.TEFISHP);
20 histogram(x(x>6.9)); hold on
21 histogram(y(y>6.9));
22 
23 clearvars -except ADSP LOCI CASE CTRL PHEN...
24 COHTRANCASE COHTRANCTRL COHTESTCASE COHTESTCTRL...
25 CTRCASE CTRCTRL CTECASE CTECTRL TRCASEN TRCTRLN TECASEN TECTRLN


This histogram displays the distribution of -log(P) values for all p<.001


Target Variants for Machine Learning


This next code sequence is all aimed at whittling-down the number of variants to only those that carry meaningful information about whether someone is in the CASE or CTRL group. In the end, we will want a list of approximately 1000 variants.

  1 %% STORE VARIABLES FOR NEURAL NETWORK TRAINING AS 'AMX'
  2 
  3 AMX         = LOCI;
  4 AMXCASE     = CASE;
  5 AMXCTRL     = CTRL;
  6 AMXTRCASE   = CTRCASE;
  7 AMXTRCTRL   = CTRCTRL;
  8 AMXTECASE   = CTECASE;
  9 AMXTECTRL   = CTECTRL;
 10 
 11 
 12 clearvars -except ADSP GENB LOCI CASE CTRL PHEN...
 13 CTRCASE CTRCTRL CTECASE CTECTRL TRCASEN TRCTRLN TECASEN TECTRLN...
 14 AMX AMXCASE AMXCTRL AMXTRCASE AMXTRCTRL AMXTECASE AMXTECTRL
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 %% FILTER VARIANTS BASED ALT > REF
 24 
 25 PASS = (AMX.TRCASEREF > AMX.TRCASEALT./1.5) | (AMX.TRCTRLREF > AMX.TRCTRLALT./1.5);
 26 sum(~PASS)
 27 
 28 AMX      = AMX(PASS,:);
 29 AMXCASE  = AMXCASE(PASS);
 30 AMXCTRL  = AMXCTRL(PASS);
 31 AMX.VID  = (1:size(AMX,1))';
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 %% SORT AMX VARIANT LOCI TABLE BY FISHER'S P-VALUE FOR TRAINING DATA
 40 
 41 
 42 [X,i] = sort(AMX.TRFISHP);
 43 
 44 AMX      = AMX(i,:);
 45 AMXCASE  = AMXCASE(i);
 46 AMXCTRL  = AMXCTRL(i);
 47 AMX.VID  = (1:size(AMX,1))';
 48 
 49 disp(AMX(1:9,:))
 50 
 51 
 52 clearvars -except ADSP GENB LOCI CASE CTRL PHEN...
 53 CTRCASE CTRCTRL CTECASE CTECTRL TRCASEN TRCTRLN TECASEN TECTRLN...
 54 AMX AMXCASE AMXCTRL AMXTRCASE AMXTRCTRL AMXTECASE AMXTECTRL
 55 
 56 
 57 
 58 
 59 
 60 
 61 
 62 
 63 
 64 
 65 %% FILTER VARIANTS BASED ON FISHP
 66 PASS = (AMX.TRFISHP < .05);
 67 sum(PASS)
 68 
 69 AMX      = AMX(PASS,:);
 70 AMXCASE  = AMXCASE(PASS);
 71 AMXCTRL  = AMXCTRL(PASS);
 72 AMX.VID  = (1:size(AMX,1))';
 73 
 74 
 75 
 76 
 77 
 78 
 79 
 80 
 81 
 82 
 83 %% FILTER VARIANTS BASED ON LOW ALT
 84 %------------------------------------------------
 85 PASS = ((AMX.TRCASEALT > 15) | (AMX.TRCTRLALT > 15)) &...
 86        ((AMX.TRCASEALT >  0) & (AMX.TRCTRLALT >  0));
 87 sum(PASS)
 88 
 89 AMX      = AMX(PASS,:);
 90 AMXCASE  = AMXCASE(PASS);
 91 AMXCTRL  = AMXCTRL(PASS);
 92 AMX.VID  = (1:size(AMX,1))';
 93 
 94 
 95 
 96 
 97 
 98 %% FILTER VARIANTS BASED ON TRANSLATIONAL EFFECT
 99 %------------------------------------------------
100 
101 %   [SYN MIS STG STL SPA SPD INT NCE UTR CUK]
102 L = [ 1   1   1   1   1   1   1   1   1   1];
103 
104 PASS  = keepmutant(AMX,L);
105 fprintf('\n %.0f loci remain post-filtering by effect (syn,mis,etc) \n',sum(PASS))
106 
107 AMX      = AMX(PASS,:);
108 AMXCASE  = AMXCASE(PASS);
109 AMXCTRL  = AMXCTRL(PASS);
110 AMX.VID  = (1:size(AMX,1))';
111 
112 
113 
114 
115 
116 
117 %% TAKE THE TOP N NUMBER OF VARIANTS
118 %------------------------------------------------
119 N = 800;
120 
121 if size(AMX,1)>N
122 AMX      = AMX(1:N,:);
123 AMXCASE  = AMXCASE(1:N);
124 AMXCTRL  = AMXCTRL(1:N);
125 AMX.VID  = (1:size(AMX,1))';
126 else
127 N=size(AMX,1);
128 end
129 fprintf('\n %.0f final loci count \n\n',size(AMX,1))
130 
131 
132 
133 clearvars -except ADSP GENB LOCI CASE CTRL PHEN...
134 CTRCASE CTRCTRL CTECASE CTECTRL TRCASEN TRCTRLN TECASEN TECTRLN...
135 AMX AMXCASE AMXCTRL AMXTRCASE AMXTRCTRL AMXTECASE AMXTECTRL


Visualize Relationship Between Training and Testing P-values


To have any hope that the neural net classifier will generalize to the testing dataset, there should be some correlation between the the P-value of variants in the training dataset and the testing dataset. This next code segment plots this relationship.

 1 %% VISUALIZE CORRELATION BETWEEN TRAINING AND TESTING FISHER'S P-VALUES
 2 
 3 clc; close all;
 4 fh1 = figure('Units','normalized','OuterPosition',[.1 .05 .7 .92],'Color','w','MenuBar','none');
 5 ax1 = axes('Position',[.06 .06 .9 .9],'Color','none');
 6 
 7 
 8 ph1 = loglog(-log(AMX.TRFISHP(1:end))+1 , -log(AMX.TEFISHP(1:end))+1,'-mo',...
 9     'LineStyle','none',...
10     'MarkerEdgeColor','k',...
11     'MarkerFaceColor',[.49 1 .63],...
12     'MarkerSize',10);
13 grid on; axis tight
14 
15 
16 clearvars -except ADSP GENB LOCI CASE CTRL PHEN...
17 CTRCASE CTRCTRL CTECASE CTECTRL TRCASEN TRCTRLN TECASEN TECTRLN...
18 AMX AMXCASE AMXCTRL AMXTRCASE AMXTRCTRL AMXTECASE AMXTECTRL
19 
20 
21 disp(AMX(1:50,:))
22 [RHO,PVAL] = corr(-log(AMX.TRFISHP),-log(AMX.TEFISHP))

Training Testing P corr.png





Format Matrices for Neural Net Training and Testing


This is the penultimate step before training the neural net classifier. Here we will take our list of target variants along with our training and testing cohorts, and construct a PERSON-by-VARIANT matrix for each.

 1 %% MAKE  RECTANGLE  NN VARIANT MATRIX
 2 
 3 AMX.FISHP      = AMX.TRFISHP;
 4 AMX.FISHOR     = AMX.TRFISHOR;
 5 AMX.CASEREF    = AMX.TRCASEREF;
 6 AMX.CASEALT    = AMX.TRCASEALT;
 7 AMX.CTRLREF    = AMX.TRCTRLREF;
 8 AMX.CTRLALT    = AMX.TRCTRLALT;
 9 
10 AMXTRCACO = [AMXTRCASE; AMXTRCTRL];
11 AMXTECACO = [AMXTECASE; AMXTECTRL];
12 
13 
14 
15 [TRNN, TRCA, TRCO] = varmx(AMX,AMXCASE,AMXCTRL,AMXTRCACO);
16 
17 [TENN, TECA, TECO] = varmx(AMX,AMXCASE,AMXCTRL,AMXTECACO);
18 
19 
20 
21 clearvars -except ADSP GENB LOCI CASE CTRL PHEN...
22 CTRCASE CTRCTRL CTECASE CTECTRL TRCASEN TRCTRLN TECASEN TECTRLN...
23 AMX AMXCASE AMXCTRL AMXTRCASE AMXTRCTRL AMXTECASE AMXTECTRL...
24 AMXTRCACO AMXTECACO TRNN TRCO TRCA TENN TECO TECA
25 
26 
27 
28 
29 
30 %% MAKE  TRAINING AND TESTING SETS
31 
32 % TRAIN DATA MATRIX needs to have:
33 %   variables (variants) in rows
34 %   examples  (people)   in columns
35 %
36 % TRAIN LABEL MATRIX will have as many rows as there are classes. 
37 % Here there are just two classes, CASE and CTRL. If a person is a CASE,
38 % put a '1' in the first row, and '0' in all other rows; if a person is
39 % a CTRL put a '1' in the second row, and '0' everywhere else. This label
40 % matrix will have as many columns as there are people; and will have
41 % the same number of columns as the training data matrix.
42 %
43 % For example if there are 1000 variants, and 300 cases and 200 ctrls,
44 % the training data matrix should be 1000x500 and the training label
45 % matrix should be 2x500.
46 % 
47 % TEST DATA MATRIX should have the same number of rows (variants)
48 % as the TRAIN DATA MATRIX, and those rows should be in the same
49 % exact order in terms of variable (variant) they represent. The
50 % columns do not need to be the same length, since those represent
51 % people, and are treated as independent examples.
52 
53 
54 [TRAINX,TRAINL,TESTX,TESTL] = traintestmx(TRNN, TENN);
55 
56 TRAINMX  = TRAINX';
57 TRAINLAB = TRAINL';
58 
59 TESTMX  = TESTX';
60 TESTLAB = TESTL';
61 
62 
63 clearvars -except ADSP GENB LOCI CASE CTRL PHEN...
64 CTRCASE CTRCTRL CTECASE CTECTRL TRCASEN TRCTRLN TECASEN TECTRLN...
65 AMX AMXCASE AMXCTRL AMXTRCASE AMXTRCTRL AMXTECASE AMXTECTRL...
66 TRAINMX TRAINLAB TESTMX TESTLAB TRNN TENN


Machine Learning


Finally we will train a 3-layer neural network pattern classifier. The neural net classifier requires a 2D rectangular matrix for training. This PERSON x LOCI matrix, constructed above, will indicate for each person at each locus (chr:pos) whether they are homozygous reference ('0'), heterozygous ('1'), or homozygous alternate ('2').


1 %% MACHINE LEARNING USING ARTIFICIAL NEURAL NETWORKS
2 
3 NN = patternnet([55 25]);
4 
5 net = train(NN,TRAINMX,TRAINLAB);
6 
7 [NNPerf] = netstats(net,TRAINMX,TRAINLAB,TESTMX,TESTLAB,.05,.85,.15);

NN Tool.png NN ROC.png


Results Figures


The plots below are based on multiple runs of the script above. Each run reduced the number of people in the training group as indicated in the x-axis figure legend. Raw data was imported into Graphpad Prism to make these nice looking figures:

This figure illustrates the drop in classifier performance seen when reducing the number of people in the training group, prior to establishing a set of variants to use during classifier training. Hi Conf Correct Using N Persons.png


Starting with 2300 people and a fixed set of variants, here is the result of serially reducing the number of people in the training group. PreP Serial Reduction.png









Additional Genomics Analyses




Other Analyses
















Notes


Category:ADSP