Difference between revisions of "ADSP t-SNE"

From Bradwiki
Jump to navigation Jump to search
m
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
The Alzheimer's Disease Sequencing Project ([https://www.niagads.org/adsp/content/home ADSP]) has made available WES data in vcf format. The following is a step-by-step analysis walkthrough of this dataset. Here the ultimate goal is to develop a platform for Alzheimer's Disease diagnosis based on genome sequencing information. The data for this analysis can be downloaded here:
+
t-Distributed Stochastic Neighbor Embedding (tSNE) is a technique like PCA that allows one perform dimensionality reduction for visualization purposes. Supposedly tSNE does better than PCA at revealing clusters in high-dimensional data. Whereas PCA only allows you to visualize two or three components directly against each at the same time -- tSNE uses math magic to coerce a high-dimensional dataset into either a 2D or 3D array.
 
 
: [https://drive.google.com/open?id=1e3tIbQhcDUF1vofAwf4oYVOl8XriC44v ADSP_WES_VCF_LATEST_RELEASE.mat].
 
 
 
Both the dataset and the following analysis code are in MATLAB format. Note however there is an [https://cran.r-project.org/web/packages/R.matlab/index.html R package] for reading .mat files. All MATLAB code and custom functions used in the demo walkthrough below can be downloaded here:
 
 
 
: [https://github.com/subroutines/genos GENOS GITHUB CODE REPO]
 
  
 
{{SmallBox|float=right|clear=none|margin=0px 0px 8px 18px|width=170px|font-size=13px|Other Analyses|txt-size=11px|
 
{{SmallBox|float=right|clear=none|margin=0px 0px 8px 18px|width=170px|font-size=13px|Other Analyses|txt-size=11px|
1. [[ADSP|Intro]]<br>
+
1. [[AD|Intro]]<br>
4. [[ADSP Neural Nets|More Neural Nets]]<br>
+
4. [[AD Neural Nets|More Neural Nets]]<br>
2. [[ADSP PCA|PCA]]<br>
+
2. [[AD PCA|PCA]]<br>
3. [[ADSP t-SNE|t-SNE]]<br>
+
3. [[AD t-SNE|t-SNE]]<br>
5. [[ADSP Stats|Descriptive Statistics]]<br>
+
5. [[AD Stats|Descriptive Statistics]]<br>
 
}}
 
}}
  
<br> <br> <br> <br> <br> <br> <br>
+
t-SNE models each multi-dim object against a point on a euclidean surface in such a way that similar features are modeled by nearby point functions and dissimilar features are modeled by distant point functions. It then projects these points onto the plane allowing you visualize, what would effectively be, all the interesting principal component combinations - the ones that yield unique clusters - simultaneously.
 
 
  
 +
<br> <br> <br>
  
==tSNE : t-Distributed Stochastic Neighbor Embedding==
+
==t-SNE Code==
 
----
 
----
 
<br><br>
 
<br><br>
Line 31: Line 25:
  
  
MATDATA = 'ADSPdata.mat';
+
MATDATA = 'ADdata.mat';
 
which(MATDATA)
 
which(MATDATA)
 
load(MATDATA)
 
load(MATDATA)
  
clearvars -except ADSP
+
clearvars -except AD
  
  
  
%% CARBON COPY MAIN VARIABLES FROM ADSP.STRUCT
+
%% CARBON COPY MAIN VARIABLES FROM AD.STRUCT
  
LOCI = ADSP.LOCI(:,1:17);
+
LOCI = AD.LOCI(:,1:17);
CASE = ADSP.CASE;
+
CASE = AD.CASE;
CTRL = ADSP.CTRL;
+
CTRL = AD.CTRL;
PHEN = ADSP.PHEN;
+
PHEN = AD.PHEN;
  
clearvars -except ADSP LOCI CASE CTRL PHEN
+
clearvars -except AD LOCI CASE CTRL PHEN
  
  
Line 65: Line 59:
  
  
clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
+
clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
  
  
Line 96: Line 90:
  
  
clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
+
clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
  
  
Line 117: Line 111:
  
  
clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
+
clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
  
  
Line 150: Line 144:
  
  
clc; clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
+
clc; clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
 
disp(LOCI(1:9,:))
 
disp(LOCI(1:9,:))
  
Line 164: Line 158:
  
  
clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
+
clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
 
AMX AMXCASE AMXCTRL  
 
AMX AMXCASE AMXCTRL  
  
Line 184: Line 178:
  
  
clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
+
clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
 
AMX AMXCASE AMXCTRL  
 
AMX AMXCASE AMXCTRL  
  
Line 202: Line 196:
 
fprintf('\n %.0f final loci count \n\n',size(AMX,1))
 
fprintf('\n %.0f final loci count \n\n',size(AMX,1))
  
clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
+
clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
 
AMX AMXCASE AMXCTRL  
 
AMX AMXCASE AMXCTRL  
  
Line 217: Line 211:
 
[ADNN, caMX, coMX] = varmx(AMX,AMXCASE,AMXCTRL,PHE);
 
[ADNN, caMX, coMX] = varmx(AMX,AMXCASE,AMXCTRL,PHE);
  
clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
+
clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
 
AMX AMXCASE AMXCTRL ADNN  
 
AMX AMXCASE AMXCTRL ADNN  
  
Line 248: Line 242:
  
  
clearvars -except ADSP LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
+
clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
 
AMX AMXCASE AMXCTRL ADNN PCAMX   
 
AMX AMXCASE AMXCTRL ADNN PCAMX   
  
Line 269: Line 263:
 
% tSN = tsne(PCAC(:,1:10),'NumDimensions',2,'Theta',.6,'NumPCAComponents',0);
 
% tSN = tsne(PCAC(:,1:10),'NumDimensions',2,'Theta',.6,'NumPCAComponents',0);
 
%  
 
%  
% clearvars -except ADSP GENB LOCI CASE CTRL PHEN AMX AMXCASE AMXCTRL...
+
% clearvars -except AD GENB LOCI CASE CTRL PHEN AMX AMXCASE AMXCTRL...
 
% PHE ADNN PCAMX tSN PCAC PCAS
 
% PHE ADNN PCAMX tSN PCAC PCAS
  
Line 287: Line 281:
  
 
disp('done')
 
disp('done')
clearvars -except ADSP GENB LOCI CASE CTRL PHEN AMX AMXCASE AMXCTRL...
+
clearvars -except AD GENB LOCI CASE CTRL PHEN AMX AMXCASE AMXCTRL...
 
PHE ADNN PCAMX tSN PCAC PCAS
 
PHE ADNN PCAMX tSN PCAC PCAS
 
</syntaxhighlight>
 
</syntaxhighlight>
Line 501: Line 495:
 
<br><br>
 
<br><br>
 
{{SmallBox|float=left|clear=none|margin=0px 0px 8px 18px|width=50%|font-size=18px|Other Analyses|txt-size=14px|
 
{{SmallBox|float=left|clear=none|margin=0px 0px 8px 18px|width=50%|font-size=18px|Other Analyses|txt-size=14px|
1. [[ADSP|Intro]]<br>
+
1. [[AD|Intro]]<br>
4. [[ADSP Neural Nets|More Neural Nets]]<br>
+
4. [[AD Neural Nets|More Neural Nets]]<br>
2. [[ADSP PCA|PCA]]<br>
+
2. [[AD PCA|PCA]]<br>
3. [[ADSP t-SNE|t-SNE]]<br>
+
3. [[AD t-SNE|t-SNE]]<br>
5. [[ADSP Stats|Descriptive Statistics]]<br>
+
5. [[AD Stats|Descriptive Statistics]]<br>
 
}}
 
}}
  
Line 516: Line 510:
 
----
 
----
  
[http://www.bradleymonk.com/Category:ADSP Category:ADSP]
+
[http://www.bradleymonk.com/Category:AD Category:AD]
[[Category:ADSP]]
+
[[Category:AD]]

Latest revision as of 15:08, 11 June 2018

t-Distributed Stochastic Neighbor Embedding (tSNE) is a technique like PCA that allows one perform dimensionality reduction for visualization purposes. Supposedly tSNE does better than PCA at revealing clusters in high-dimensional data. Whereas PCA only allows you to visualize two or three components directly against each at the same time -- tSNE uses math magic to coerce a high-dimensional dataset into either a 2D or 3D array.

Other Analyses


t-SNE models each multi-dim object against a point on a euclidean surface in such a way that similar features are modeled by nearby point functions and dissimilar features are modeled by distant point functions. It then projects these points onto the plane allowing you visualize, what would effectively be, all the interesting principal component combinations - the ones that yield unique clusters - simultaneously.




t-SNE Code




  1 % ######################################################################
  2 %%       tSNE : t-Distributed Stochastic Neighbor Embedding
  3 % ######################################################################
  4 clc; close all; clear; rng('shuffle')
  5 cd(fileparts(which('GENOS.m')));
  6 
  7 
  8 MATDATA = 'ADdata.mat';
  9 which(MATDATA)
 10 load(MATDATA)
 11 
 12 clearvars -except AD
 13 
 14 
 15 
 16 %% CARBON COPY MAIN VARIABLES FROM AD.STRUCT
 17 
 18 LOCI = AD.LOCI(:,1:17);
 19 CASE = AD.CASE;
 20 CTRL = AD.CTRL;
 21 PHEN = AD.PHEN;
 22 
 23 clearvars -except AD LOCI CASE CTRL PHEN
 24 
 25 
 26 
 27 
 28 
 29 %###############################################################
 30 %%       DETERMINE WHICH PARTICIPANTS TO KEEP
 31 %###############################################################
 32 
 33 
 34 
 35 PHE = PHEN(PHEN.TOTvars>14000,:);
 36 
 37 
 38 PHECASE = PHE(PHE.AD==1,:);
 39 PHECTRL = PHE(PHE.AD==0,:);
 40 
 41 
 42 clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
 43 
 44 
 45 
 46 
 47 
 48 
 49 
 50 
 51 
 52 %###############################################################
 53 %%          COUNT NUMBER OF VARIANTS PER LOCI
 54 %###############################################################
 55 
 56 % The varsum() function will go through each known variant loci
 57 % and check whether anyone's SRR ID from your subset of IDs match
 58 % all known SRR IDs for that loci. It will then sum the total
 59 % number of alleles (+1 for hetzy-alt, +2 for homzy-alt) for each
 60 % loci and return the totals.
 61 
 62 
 63 [CASEN, CTRLN] = varsum(CASE, PHECASE.SRR, CTRL, PHECTRL.SRR);
 64 
 65 
 66 % SAVE COUNTS AS NEW TABLE COLUMNS
 67 LOCI.CASEREFS = numel(PHECASE.SRR)*2-CASEN;
 68 LOCI.CTRLREFS = numel(PHECTRL.SRR)*2-CTRLN;
 69 LOCI.CASEALTS = CASEN;
 70 LOCI.CTRLALTS = CTRLN;
 71 
 72 
 73 clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
 74 
 75 
 76 
 77 
 78 
 79 
 80 
 81 %###############################################################
 82 %%               COMPUTE FISHER'S P-VALUE
 83 %###############################################################
 84 
 85 
 86 % COMPUTE FISHERS STATISTICS FOR THE TRAINING GROUP
 87 [FISHP, FISHOR] = fishp_mex(LOCI.CASEREFS,LOCI.CASEALTS,...
 88                             LOCI.CTRLREFS,LOCI.CTRLALTS);
 89 
 90 LOCI.FISHPS  = FISHP;
 91 LOCI.FISHORS = FISHOR;
 92 
 93 
 94 clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
 95 
 96 
 97 
 98 
 99 
100 %% MAKE LATEST COUNTS THE MAIN TABLE STATS
101 
102 LOCI.CASEREF = LOCI.CASEREFS;
103 LOCI.CTRLREF = LOCI.CTRLREFS;
104 LOCI.CASEALT = LOCI.CASEALTS;
105 LOCI.CTRLALT = LOCI.CTRLALTS;
106 LOCI.FISHP   = LOCI.FISHPS;
107 LOCI.FISHOR  = LOCI.FISHORS;
108 
109 
110 
111 
112 
113 
114 %% SORT VARIANT LOCI TABLE BY FISHER P-VALUE
115 
116 [X,i] = sort(LOCI.FISHP);
117 
118 LOCI  = LOCI(i,:);
119 CASE  = CASE(i);
120 CTRL  = CTRL(i);
121 LOCI.VID = (1:size(LOCI,1))';
122 
123 LOCI.GENE = string(LOCI.GENE);
124 
125 
126 
127 clc; clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL
128 disp(LOCI(1:9,:))
129 
130 
131 
132 
133 
134 %% STORE VARIABLES FOR PCA/TSNE AS 'AMX'
135 
136 AMX         = LOCI;
137 AMXCASE     = CASE;
138 AMXCTRL     = CTRL;
139 
140 
141 clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
142 AMX AMXCASE AMXCTRL 
143 
144 
145 
146 
147 
148 %% FILTER VARIANTS BASED ALT > REF
149 
150 PASS = (AMX.CASEREF > AMX.CASEALT./1.5) | (AMX.CTRLREF > AMX.CTRLALT./1.5);
151 sum(~PASS)
152 
153 AMX      = AMX(PASS,:);
154 AMXCASE  = AMXCASE(PASS);
155 AMXCTRL  = AMXCTRL(PASS);
156 AMX.VID  = (1:size(AMX,1))';
157 
158 
159 
160 
161 clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
162 AMX AMXCASE AMXCTRL 
163 
164 
165 
166 
167 
168 %% TAKE THE TOP N NUMBER OF VARIANTS
169 N = 100;
170 
171 
172 AMX      = AMX(1:N,:);
173 AMXCASE  = AMXCASE(1:N);
174 AMXCTRL  = AMXCTRL(1:N);
175 AMX.VID  = (1:size(AMX,1))';
176 
177 fprintf('\n %.0f final loci count \n\n',size(AMX,1))
178 
179 clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
180 AMX AMXCASE AMXCTRL 
181 
182 
183 
184 
185 
186 
187 
188 
189 %% MAKE  RECTANGLE  NN VARIANT MATRIX
190 
191 
192 [ADNN, caMX, coMX] = varmx(AMX,AMXCASE,AMXCTRL,PHE);
193 
194 clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
195 AMX AMXCASE AMXCTRL ADNN 
196 
197 
198 
199 
200 
201 
202 
203 
204 
205 %% RANDOMIZE ADNN AND REORDER PHE TO MATCH ADNN
206 
207 ADL = ADNN(1,:);
208 ADN = ADNN(2:end,:);
209 i = randperm(size(ADN,1));
210 ADN = ADN(i,:);
211 ADNN = [ADL;ADN];
212 
213 
214 [i,j] = ismember(PHE.SRR, ADN(:,1) );
215 PHE.USED = i;
216 PHE.ORDER = j;
217 PHE = PHE(PHE.USED,:);
218 PHE = sortrows(PHE,'ORDER');
219 
220 
221 
222 PCAMX = ADNN(2:end,4:end);
223 
224 
225 clearvars -except AD LOCI CASE CTRL PHEN PHE PHECASE PHECTRL...
226 AMX AMXCASE AMXCTRL ADNN PCAMX  
227 
228 
229 
230 
231 %% (OPTIONAL) PRE-PERFORM PCA BEFORE TSNE 
232 
233 % ss = statset('pca');
234 % ss.Display = 'iter';
235 % ss.MaxIter = 100;
236 % ss.TolFun = 1e4;
237 % ss.TolX = 1e4;
238 % ss.UseParallel = true;
239 % 
240 % [PCAC,PCAS,~,~,~] = pca(  PCAMX' , 'Options',ss);
241 % clc; close all; scatter(PCAC(:,1),PCAC(:,2))
242 %
243 % % ...,'NumPCAComponents',0,...  means don't use PCA
244 % tSN = tsne(PCAC(:,1:10),'NumDimensions',2,'Theta',.6,'NumPCAComponents',0);
245 % 
246 % clearvars -except AD GENB LOCI CASE CTRL PHEN AMX AMXCASE AMXCTRL...
247 % PHE ADNN PCAMX tSN PCAC PCAS
248 
249 
250 
251 
252 
253 
254 %######################################################################
255 %%       tSNE : t-Distributed Stochastic Neighbor Embedding
256 %######################################################################
257 
258 
259 
260 tSN = tsne(PCAMX,'NumDimensions',2,'Theta',.6,'NumPCAComponents',8);
261 
262 
263 disp('done')
264 clearvars -except AD GENB LOCI CASE CTRL PHEN AMX AMXCASE AMXCTRL...
265 PHE ADNN PCAMX tSN PCAC PCAS




t-SNE Plots






ALZHEIMER'S STATUS

 1 %% PLOT TSNE --- ALZHEIMER'S STATUS (CASE/CTRL) --------------------------
 2 close all; 
 3 fh1=figure('Units','normalized','Position',[.05 .05 .70 .84],'Color','w');
 4 ax1=axes('Position',[.05 .02 .9 .9],'Color','none');
 5 
 6 ph1 = gscatter(tSN(:,1),tSN(:,2),  PHE.AD, [],'.',15);
 7 
 8 title({'\fontsize{16} t-SNE : CASE vs CTRL',' '})
 9 legend(ph1,{'CTRL','CASE'},'FontSize',12,'Box','off','Location','NorthWest');
10 axis off

Top 100 variants TSNE Case Control.png

Top 2000 variants TSNE Case Control 2kvars.png




STUDY COHORT

 1 %% PLOT TSNE --- CONSORTIUM STUDY COHORT (1:24) -------------------------
 2 close all; 
 3 fh1=figure('Units','normalized','Position',[.05 .05 .70 .84],'Color','w');
 4 ax1=axes('Position',[.05 .02 .9 .9],'Color','none');
 5 
 6 ph1 = gscatter(tSN(:,1),tSN(:,2),  PHE.COHORT, [],'.',15);
 7 
 8 
 9 title({'\fontsize{16} t-SNE : STUDY COHORT',' '})
10 % legend(ph1,{'CTRL','CASE'},'FontSize',12,'Box','off','Location','NorthWest');
11 axis off

Top 100 variants TSNE Study Cohort.png

Top 2000 variants TSNE Study Cohort 2kvars.png



SEX

 1 %% PLOT TSNE --- SEX (M/F) ----------------------------------------------
 2 close all; 
 3 fh1=figure('Units','normalized','Position',[.05 .05 .70 .84],'Color','w');
 4 ax1=axes('Position',[.05 .02 .9 .9],'Color','none');
 5 
 6 ph1 = gscatter(tSN(:,1),tSN(:,2),  PHE.SEX, [],'.',15);
 7 
 8 
 9 title({'\fontsize{16} t-SNE : SEX',' '})
10 legend(ph1,{'Male','Female'},'FontSize',12,'Box','off','Location','NorthWest');
11 axis off

Top 100 variants TSNE Sex.png

Top 2000 variants TSNE Sex 2kvars.png



AGE

 1 %% PLOT TSNE --- AGE (BINNED AGE) ---------------------------------------
 2 close all; 
 3 fh1=figure('Units','normalized','Position',[.05 .05 .70 .84],'Color','w');
 4 ax1=axes('Position',[.05 .02 .9 .9],'Color','none');
 5 
 6 AGE = round(PHE.AGE);
 7 ofAGE = AGE>60;
 8 A = AGE(ofAGE);
 9 
10 histogram(AGE)
11 
12 [Y,E] = discretize(A,[60 80 90 91]);
13 % [Y,E] = discretize(A,[60 75 85 90 91]);
14 for nn = 1:numel(E)
15 A(Y==nn) = E(nn);
16 end
17 
18 ph1 = gscatter(tSN(ofAGE,1),tSN(ofAGE,2),  A, [],'.',15);
19 
20 
21 title({'\fontsize{16} t-SNE : AGE',' '})
22 % legend(ph1,{'CTRL','CASE'},'FontSize',12,'Box','off','Location','NorthWest');
23 axis off

Top 100 variants TSNE Age.png

Top 2000 variants TSNE Age 2kvars.png



APOE STATUS

 1 %% PLOT TSNE --- APOE STATUS (22,23,24,33,34,44) ------------------------
 2 close all; 
 3 fh1=figure('Units','normalized','Position',[.05 .05 .70 .84],'Color','w');
 4 ax1=axes('Position',[.05 .02 .9 .9],'Color','none');
 5 
 6 
 7 ph1 = gscatter(tSN(:,1),tSN(:,2),  PHE.APOE, [],'.',15);
 8 
 9 
10 ph1(1).MarkerSize = 35;
11 ph1(2).MarkerSize = 25;
12 ph1(2).Color = [.20 .20 .99];
13 ph1(3).MarkerSize = 35;
14 ph1(4).Color = [.99 .50 .10];
15 ph1(5).Color = [.30 .70 .80];
16 ph1(6).MarkerSize = 25;
17 
18 title({'\fontsize{16} t-SNE : APOE',' '})
19 % legend(ph1,{'CTRL','CASE'},'FontSize',12,'Box','off','Location','NorthWest');
20 axis off

Top 100 variants TSNE APOE.png

Top 2000 variants TSNE APOE 2kvars.png



CONSENT GROUP

 1 %% PLOT TSNE --- CONSENT GROUP ------------------------------------------
 2 close all; 
 3 fh1=figure('Units','normalized','Position',[.05 .05 .70 .84],'Color','w');
 4 ax1=axes('Position',[.05 .02 .9 .9],'Color','none');
 5 
 6 
 7 ph1 = gscatter(tSN(:,1),tSN(:,2),  PHE.RD, [],'.',15);
 8 
 9 
10 title({'\fontsize{16} t-SNE : CONSENT GROUP',' '})
11 % legend(ph1,{'CTRL','CASE'},'FontSize',12,'Box','off','Location','NorthWest');
12 axis off

Top 2000 variants TSNE Consent 2kvars.png






















Additional Genomics Analyses




Other Analyses










Notes


Category:AD