Model-based clustering and validation techniques for gene expression

Model-based clustering and validation techniques for gene expression

Model-based clustering and validation techniques for gene expression data Ka Yee Yeung Department of Microbiology University of Washington, Seattle WA Overview Introduction to microarrays Clustering 101 Validating clustering results on microarray data Model-based clustering using microarray data Co-expression == co-regulation ?? 02/13/20 Ka Yee Yeung - Lipari 2004 2 Introduction to Microarrays 02/13/20 Ka Yee Yeung - Lipari 2004

3 DNA Arrays Measure the Concentration of 1000s of Genes Simultaneously On the surface A 02/13/20 B In solution 4 copies of gene A, 1 copy of gene B Ka Yee Yeung - Lipari 2004 After Hybridization A B 4 Two-color analysis allows for comparative studies to be done In solution #1

After Hybridization 4 copies of gene A, 1 copy of gene B In solution #2 A B 4 copies of gene A, 4 copies of gene B 02/13/20 Ka Yee Yeung - Lipari 2004 5 Cell Population #1 Cell Population #2 Extract mRNA Extract mRNA Make cDNA Label w/ Green Fluor

Make cDNA Label w/ Red Fluor Co-hybridize . . . . . . Slide with DNA from different genes Scan The Gene Chip from Affymetrix 02/13/20 Ka Yee Yeung - Lipari 2004 7 Affymetrix chips: oligonucleotide arrays

PM (Perfect Match) vs. MM (mis-match) 02/13/20 Ka Yee Yeung - Lipari 2004 8 A gene expression data set .. Snapshot of activities in the cell Each chip represents an n experiment: genes time course tissue samples (normal/cancer) 02/13/20 Ka Yee Yeung - Lipari 2004 p experiments Xij 9

What is clustering? Clustering genes Group similar objects together 02/13/20 E1 Clustering experiments E2 E3 E4 Gene 1 -2 +2 +2 -1 Gene 2 +8 +3

0 +4 Gene 3 -4 +5 +4 -2 Gene 4 -1 +4 +3 -1 Ka Yee Yeung - Lipari 2004 10 Applications of clustering gene expression data Guilt by association E.g. Cluster the genes functionally related genes

Class discovery E.g. Cluster the experiments discover new subtypes of tissue samples 02/13/20 Ka Yee Yeung - Lipari 2004 11 Clustering 101 02/13/20 Ka Yee Yeung - Lipari 2004 12 What is clustering? Group similar objects together Objects in the same cluster (group) are more similar to each other than objects in different clusters Data exploratory tool Unsupervised method Do not assume any knowledge of the genes or experiments 02/13/20

Ka Yee Yeung - Lipari 2004 13 1 How to define similarity?X gene 1 Experiment p s n s gene s gene s X Y Y n

n Raw Similarity matrix Similarity matrix measure: A measure of pairwise similarity or dissimilarity Examples: Correlation coefficient Euclidean distance 02/13/20 Ka Yee Yeung - Lipari 2004 14 Similarity measures Euclidean distance p 2 ( X [ j ] Y [ j

] ) Correlation coefficient j=1 p p (X [ j ] X )(Y [ j ] Y ) X [ j] j =1 p , where X = p (X [ j ] X ) (Y [ j ] Y ) 2 j =1 02/13/20

2 j =1 p j =1 Ka Yee Yeung - Lipari 2004 15 Example X Y Z W 1 3 -1 2 0 2 0 0 -1 1

1 -2 0 2 0 0 4 3 2 X 1 Y 0 Z -1 1 2 3 4

W -2 -3 Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41 02/13/20 Ka Yee Yeung - Lipari 2004 16 Lessons from the example Correlation direction only Euclidean distance magnitude & direction 02/13/20

Ka Yee Yeung - Lipari 2004 17 Clustering algorithms Inputs: Raw data matrix or similarity matrix Number of clusters or some other parameters Hierarchical vs Partitional algorithms 02/13/20 Ka Yee Yeung - Lipari 2004 18 Hierarchical Clustering [Hartigan 1975] Agglomerative (bottom-up) Algorithm: Initialize: each item a cluster Iterate: dendrogra m 02/13/20

select two most similar clusters merge them Halt: when required number of clusters is reached Ka Yee Yeung - Lipari 2004 19 Hierarchical: Single Link cluster similarity = similarity of two most similar members - Potentially long and skinny clusters + Fast 02/13/20 Ka Yee Yeung - Lipari 2004 20 Example: single link 1 2 3 4 5 1

2 3 4 5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0

3 3 0 4 9 7 0 5 8 5 4 0 d (1, 2 ), 3 =min{d1 ,3 , d 2 , 3 } =min{6,3} =3 d (1, 2 ), 4 =min{d1, 4 , d 2 , 4 } =min{10,9} =9 d (1, 2 ), 5 =min{d1,5 , d 2 , 5 } = min{9,8} =8 02/13/20 Ka Yee Yeung - Lipari 2004 5 4 3 2 1 21 Example: single link 1 2 3 4 5 1 2 3 4 5

0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 3 0 4 9 7 0

5 8 5 4 0 d (1, 2 , 3 ), 4 = min{d(1, 2 ), 4 , d 3 , 4 } = min{9,7} =7 d (1, 2 , 3 ),5 =min{d (1, 2 ), 5 , d3 , 5 } =min{8,5} =5 02/13/20 Ka Yee Yeung - Lipari 2004 (1,2,3) 4 5 (1,2,3) 0 4 7 0 5 5 4 0 5 4 3 2 1 22 Example: single link 1 2 3 4 5 1 2 3 4

5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 3 0

4 9 7 0 5 8 5 4 0 (1,2,3) 4 5 (1,2,3) 0 4 7 0 5 5 4 0 5 d (1, 2, 3),( 4 , 5) =min{d (1, 2 , 3), 4 , d (1, 2, 3),5 } =5 02/13/20 Ka Yee Yeung - Lipari 2004 4 3 2 1 23 Hierarchical: Complete Link cluster similarity = similarity of two least similar members

+ tight clusters - slow 02/13/20 Ka Yee Yeung - Lipari 2004 24 Example: complete link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8

0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 d (1, 2 ), 3 =max{d1, 3 , d 2, 3 } = max{6,3} =6 d (1, 2 ), 4 =max{ d1 , 4 , d 2 , 4 } =max{10,9} =10 d (1, 2 ), 5 =max{d1, 5 , d 2 ,5 } =max{9,8} =9 02/13/20 Ka Yee Yeung - Lipari 2004 5 4

3 2 1 25 Example: complete link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8

0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 d (1, 2 ), ( 4 ,5 ) =max{d (1, 2 ), 4 , d (1, 2 ), 5 } = max{10,9} =10 d 3 ,( 4 , 5 ) = max{d3 , 4 , d 3, 5 } = max{7,5} =7 02/13/20 Ka Yee Yeung - Lipari 2004 (1,2) 3 ( 4,5) (1,2) 0 3 6 0 (4,5) 10 7 0 5

4 3 2 1 26 Example: complete link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8

0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 (1,2) 3 ( 4,5) (1,2) 0 3 6 0 (4,5) 10 7 0 5 d (1, 2 , 3),( 4 , 5 ) =max{d(1, 2 ), ( 4 , 5 ) , d 3, ( 4 , 5)} =10 02/13/20 Ka Yee Yeung - Lipari 2004

4 3 2 1 27 Hierarchical: Average Link cluster similarity = average similarity of all pairs + tight clusters - slow 02/13/20 Ka Yee Yeung - Lipari 2004 28 Example: average link 1 2 3 4 5 1 2 3 4 5 0

2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 4.5 0 4 9.5 7 0

5 8.5 5 4 0 1 6+3 d (1, 2 ), 3 = ( d1, 3 + d 2 ,3 ) = =4.5 2 2 1 10 + 9 d (1, 2 ), 4 = (d1, 4 + d 2, 4 ) = =9.5 2 2 1 9 +8 d (1, 2 ), 5 = ( d1, 5 + d 2 , 5 ) = =8.5 2 2 02/13/20 Ka Yee Yeung - Lipari 2004 5 4 3 2 1 29

Example: average link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0

(1,2) 3 4 5 (1,2) 0 3 4.5 0 4 9.5 7 0 5 8.5 5 4 0 1 d (1, 2 ), ( 4 ,5 ) = ( d1, 4 + d1 ,5 + d 2 , 4 + d 2 , 5 ) =9 4 1 d 3 ,( 4 , 5 ) = (d 3 , 4 + d 3 ,5 ) =6 2 02/13/20 Ka Yee Yeung - Lipari 2004 (1,2) 3 ( 4,5) (1,2) 0 3 4.5 0 (4,5) 9 6 0 5

4 3 2 1 30 Example: average link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8

0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 4.5 0 4 9.5 7 0 5 8.5 5 4 0 (1,2) 3 ( 4,5) (1,2) 0 3 4.5 0 (4,5) 9 6 0 5 1 d (1, 2 , 3 ),( 4 , 5 ) = (d1, 4 + d1, 5 + d 2 , 4 + d 2 , 5 + d 3, 4 + d 3, 5 ) =8 6 02/13/20

Ka Yee Yeung - Lipari 2004 4 3 2 1 31 Partitional: K-Means [MacQueen 1965] 2 1 3 02/13/20 Ka Yee Yeung - Lipari 2004 32 Details of k-means Iterate until converge: Assign each data point to the closest centroid Compute new centroid Properties: Converge to local optimum

In practice, converge quickly Tend to produce spherical, equal-sized clusters 02/13/20 Ka Yee Yeung - Lipari 2004 33 2D-clustering Cluster both genes and experiment s 02/13/20 Ka Yee Yeung - Lipari 2004 34 Summary Definition of clustering Pairwise similarity: Correlation Euclidean distance Clustering algorithms: Hierarchical (single-link, complete-link, average-link)

K-means 02/13/20 Ka Yee Yeung - Lipari 2004 35 Clustering microarray data What has been done? Many clustering algorithms have been proposed for gene expression data. For example: Hierarchical clustering algorithms eg. [Eisen et al 1998] K-means eg. [Tavazoie et al. 1999] Self-organizing maps (SOM) eg. [Tamayo et al. 1999] Cluster Affinity Search Technique (CAST) [BenDor, Yakhini 1999] and many others 02/13/20 Ka Yee Yeung - Lipari 2004 37 Common questions

1. How can I choose between all these clustering methods? 2. Is there a clustering algorithm that works better than the others? 3. How to choose the number of clusters? 4. How often do I get biologically meaningful clusters? 5. How many microarray experiments do I need? 02/13/20 Ka Yee Yeung - Lipari 2004 38 Validating clustering results [Yeung, Haynor, Ruzzo 2001] FOM idea Data sets Results ISI most cited paper in Computer Science (Dec 2002) 02/13/20 Ka Yee Yeung - Lipari 2004 39

Validation techniques [Jain and Dubes 1988] External validation Require external knowledge Internal validation Does not require external knowledge Goodness of fit between the data and the clusters 02/13/20 Ka Yee Yeung - Lipari 2004 40 Comparing different heuristic-based algorithms Apply a clustering algorithm to all but one experiment Use the left-out experiment to check the predictive power of the algorithm 02/13/20 Ka Yee Yeung - Lipari 2004

41 Figure of Merit (FOM) Experiments 1 gene s 1 e m Cluster C1 g Cluster Ci n Cluster Ck FOM estimates predictive power: measures uniformity of gene expression levels in the left-out condition in clusters formed

Low FOM => High predictive power 02/13/20 Ka Yee Yeung - Lipari 2004 42 FOM Experiments 1 gene s 1 g e m Cluster C1 1 k FOM (e, k ) = ( R ( g , e) Ci (e)) 2 n i =1 gCi Cluster Ci m n

Cluster Ck FOM (k ) = FOM (e, k ) e =1 R(g,e) adjusted FOM(e, k) =FOM(e, k)/ 02/13/20 Ka Yee Yeung - Lipari 2004 n-k n 43 Clustering Algorithms Partitional CAST (Ben-Dor et al. 1999) k-means (Hartigan 1975) Hierarchical single-link average-link Complete-link Random (as a control) Randomly assign genes to clusters

02/13/20 Ka Yee Yeung - Lipari 2004 44 Gene expression data sets Ovary data (Michel Schummer, Institute of Systems Biology) Subset of data : 235 clones 24 experiments (cancer/normal tissue samples) 235 clones correspond to 4 genes (classes) Yeast cell cycle data

(Cho et al 1998) 17 time points Subset of 384 genes correspond to 5 phases of cell cycle 02/13/20 Ka Yee Yeung - Lipari 2004 45 Synthetic data sets Mixture of normal distributions based on the ovary data Generate a multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data Randomly resampled ovary data For each class, randomly sample the expression levels in each experiment Near diagonal covariance matrix 02/13/20 Ka Yee Yeung - Lipari 2004

46 Randomly resampled synthetic data set Ovary data experiment Synthetic data s genes class 02/13/20 Ka Yee Yeung - Lipari 2004 47 Results: ovary data 4000 3500 3000 CAST

random average-link single-link complete-link kmeans (init avglink) 2500 FOM 2000 1500 1000 0 5 10 15 20 25 30 35 40 45

50 number of clusters CAST, k-means and complete-link : best performanace 02/13/20 Ka Yee Yeung - Lipari 2004 48 Results: yeast cell cycle data 6000 5500 5000 4500 CAST random avglink complete-link single-link k-means (init avglink) 4000 3500 FOM

3000 2500 2000 1500 1000 0 5 10 15 20 25 30 35 40 45 50 number of clusters CAST, k-means: best performance

02/13/20 Ka Yee Yeung - Lipari 2004 49 Results: mixture of normal based on ovary data 4000 3500 3000 CAST random average-link single-link complete-link k-means (init avglink) FOM 2500 2000 1500 0

5 10 15 20 25 30 35 40 45 50 number of clusters At 4 clusters: CAST lowest FOM 02/13/20 Ka Yee Yeung - Lipari 2004 50 Results: randomly

resampled ovary data 4000 3500 CAST random average-link single-link complete-link k-means (init avglink) 3000 FOM 2500 2000 1500 0 5 10 15 20

25 30 35 40 45 50 number of clusters At 4 clusters: CAST lowest FOM 02/13/20 Ka Yee Yeung - Lipari 2004 51 Summary of Results CAST and k-means produce higher quality clusters than the hierarchical algorithms Single-link has worse performance among the hierarchical algorithms 02/13/20

Ka Yee Yeung - Lipari 2004 52 Software Implementation Command line C code: not very user-friendly at the moment Third party implementation: MEV from TIGR http://www.tigr.org/software/tm4/mev.html 02/13/20 Ka Yee Yeung - Lipari 2004 53 02/13/20 Ka Yee Yeung - Lipari 2004 54 Thank-yous FOM work: David Haynor (Radiology, UW) Larry Ruzzo (Computer Science, UW) Ovary data: Michel Schummer

(Institute of Systems Biology) 02/13/20 Ka Yee Yeung - Lipari 2004 55 Ready for a break? Overview Introduction to microarrays Clustering 101 Validating clustering results on microarray data Model-based clustering using microarray data Co-expression == co-regulation ?? 02/13/20 Ka Yee Yeung - Lipari 2004 57 Common questions 1. How can I choose between all these clustering methods? 2. Is there a clustering algorithm that works

better than the others? 3. How to choose the number of clusters? 4. How often do I get biologically meaningful clusters? 5. How many microarray experiments do I need? 02/13/20 Ka Yee Yeung - Lipari 2004 58 Model-based clustering [Yeung, Fraley, Murua, Raftery, Ruzzo 2001] Overview of model-based clustering Data sets Results Summary and Future Work ISI most cited paper in Computer Science (Jan 2004) 02/13/20 Ka Yee Yeung - Lipari 2004

59 Model-based clustering Gaussian mixture model: Assume each cluster is generated by the multivariate normal distribution Each cluster k has parameters : Mean vector: k Covariance matrix: k Likelihood for the mixture model: n G Lmix ( 1 ,..., G | y ) = k f k (y i | k ) k =1 i =1 normality Data transformations & assumption 02/13/20 Ka Yee Yeung - Lipari 2004 60 EM algorithm

General appraoch to maximum likelihood Iterate between E and M steps: E step: compute the probability of each observation belonging to each cluster using the current parameter estimates M-step: estimate model parameters using the current group membership probabilities 02/13/20 Ka Yee Yeung - Lipari 2004 61 Parameterization of the covariance matrix:k=kDkAkDkT (Banfield & Raftery 1993) variance orientation shap e Equal variance spherical model (EI): ~ kmeans k = I Unequal variance spherical model (VI): k=kI 02/13/20 Ka Yee Yeung - Lipari 2004

62 Covariance matrix:k=kDkAkDkT variance orientation Unconstrained model (VVV): k=kDkAkDkT shap e EEE elliptical model: k=DADT Diagonal model: k=kBk, where Bk is diagonal with | Bk|=1 02/13/20 Ka Yee Yeung - Lipari 2004 63 Key advantage of the model-based approach: choose the model and the number of clusters Bayesian Information Criterion (BIC) A large BIC score indicates strong

evidence for the corresponding model. 02/13/20 Ka Yee Yeung - Lipari 2004 64 Definition of the BIC score 2 og(p(D |M k ) 2 og(p (D |k , M k ) k og((n) =BICk The integrated likelihood p(D|Mk) is hard to evaluate, where D is the data, Mk is the model. BIC is an approximation to log p(D| Mk) k: number of parameters to be estimated in model Mk 02/13/20 Ka Yee Yeung - Lipari 2004 65 Overall Clustering Approach For a given range of # clusters (G) For each model

Apply EM to estimate model parameters and cluster memberships Compute BIC 02/13/20 Ka Yee Yeung - Lipari 2004 66 Our Approach Our Goal: To show the model-based approach has superior performance on: Quality of clusters Number of clusters and model chosen (BIC) To compare clusters with classes: Adjusted Rand index (Hubert and Arabie 1985) High adjusted Rand index high agreement Compare the quality of clusters with a leading heuristic-based algorithm: CAST (Ben-Dor & Yakhini 1999) 02/13/20

Ka Yee Yeung - Lipari 2004 67 Adjusted Rand index Compare clusters to classes Consider # pairs of objects Same cluster Different cluster Same class a c Different class b d 02/13/20 Ka Yee Yeung - Lipari 2004

68 Example (Adjusted Rand) c#1(4) c#2(5) c#3(7) c#4(4) class#1(2) 2 0 0 0 class#2(3) 0 0 0 3 class#3(5) 1 4 0 0 class#4(10) 1 1 7 1 2 3 4 7 a = + + + =31 2 2 2 2 4 5 7 4

b = + + + a =43 31 =12 2 2 2 2 2 3 5 10 c = + + + a =59 31 =28 2 2 2 2 20 d = a b c =119 2 a+ d =0.789 a+ d + c+ d R E ( R ) Adjusted Rand = =0.469 1 E ( R ) Rand, R = 02/13/20 Ka Yee Yeung - Lipari 2004 69 Methodology for users: BIC Not require classes Choose the

number of clusters and model 02/13/20 Our evaluation methodology Adjusted Rand : Need classes Assess the agreement of clusters to the classes Ka Yee Yeung - Lipari 2004 70 Gene expression data sets Ovary data (Michel Schummer, Institute of Systems Biology)

Subset of data : 235 clones 24 experiments (cancer/normal tissue samples) 235 clones correspond to 4 genes (classes) Yeast cell cycle data (Cho et al 1998) 17 time points Subset of 384 genes correspond to 5 phases of cell cycle 02/13/20 Ka Yee Yeung - Lipari 2004 71 Synthetic data sets Mixture of normal distributions based on

the ovary data Generate a multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data Randomly resampled ovary data For each class, randomly sample the expression levels in each experiment Near diagonal covariance matrix 02/13/20 Ka Yee Yeung - Lipari 2004 72 Randomly resampled synthetic data set Ovary data experiment Synthetic data s genes class

02/13/20 Ka Yee Yeung - Lipari 2004 73 Results: mixture of normal distributions based on ovary data (2350 genes) 0.8 0.7 EI VI VVV diagonal CAST 0.6 0.5 0.4 Adjusted Rand 0.3 0.2 0 2

4 6 8 10 12 14 16 number of clusters -20000 0 2 4 6 8 10

12 14 16 -40000 -60000 At 4 clusters, VVV, diagonal, CAST : high adjusted Rand BIC selects VVV at 4 clusters. EI VI VVV diagonal -80000 BIC -100000

-120000 -140000 number of clusters 02/13/20 Ka Yee Yeung - Lipari 2004 74 Results: randomly resampled ovary data 0.9 0.8 0.7 EI VI VVV diagonal CAST EEE 0.6 0.5 Adjusted Rand 0.4

0.3 0 2 4 6 8 10 12 14 16 number of clusters -10500 0 2 4 6

8 10 12 14 16 -11000 -11500 Diagonal model achieves the max adjusted Rand and BIC score (higher than CAST) BIC max at 4 clusters Confirms expected result EI VI diagonal EEE -12000

BIC -12500 -13000 -13500 number of clusters 02/13/20 Ka Yee Yeung - Lipari 2004 75 0.8 Results: square root ovary data 0.7 0.6 EI VI VVV diagonal CAST EEE 0.5

0.4 Adjusted Rand 0.3 0.2 0 2 4 6 8 10 12 14 16 Adjusted Rand: max at EEE 4 clusters (> CAST) BIC analysis: number of clusters

0 0 2 4 6 8 10 12 14 16 -500 -1000 EI VI diagonal EEE -1500

BIC -2000 EEE and diagonal models -> first local max at 4 clusters Global max -> VI at 8 clusters -2500 -3000 number of clusters 02/13/20 Ka Yee Yeung - Lipari 2004 76 Results: standardized yeast cell cycle data 0.55 0.50 0.45 EI 0.40

VI 0.35 VVV 0.30 diagonal Adjusted Rand CAST 0.25 EEE 0.20 0.15 0 2 4 6

8 10 12 14 16 number of clusters -1000 -3000 0 2 4 6 8 10 12 14

16 Adjusted Rand: EI slightly > CAST at 5 clusters. BIC: selects EEE at 5 clusters. -5000 -7000 EI -9000 BIC -11000 VI diagonal EEE -13000 -15000 -17000 number of clusters 02/13/20

Ka Yee Yeung - Lipari 2004 77 Results: log yeast cell cycle data 0.6 0.5 0.4 EI VI VVV diagonal CAST EEE 0.3 0.2 Adjusted Rand 0.1 0.0 -0.1 0 2

4 6 8 10 12 14 12 14 16 number of clusters -3000 -4000 0 2 4 6

8 10 16 -5000 -6000 EI -7000 CAST achieves much higher adjusted Rand indices than most model-based approaches (except EEE). BIC scores of EEE much higher than the other models. VI -8000 BIC -9000

diagonal EEE -10000 -11000 -12000 -13000 number of clusters 02/13/20 Ka Yee Yeung - Lipari 2004 78 log yeast cell cycle data 02/13/20 Ka Yee Yeung - Lipari 2004 79 Standardized yeast cell cycle data 02/13/20 Ka Yee Yeung - Lipari 2004

80 Summary Synthetic data sets: With the correct model, the model-based approach excels over CAST BIC selects the right model at the correct number of clusters Real expression data sets: Comparable quality of clusters to CAST Advantage: BIC gives a hint to the number of clusters 02/13/20 Ka Yee Yeung - Lipari 2004 81 Software Implementation Software: Mclust available in Splus (Chris Fraley and Adrian Raftery) R (Ron Wehrens) Matlab (Angel Martinez and Wendy Martinez) http://www.stat.washington.edu/mclust/

02/13/20 Ka Yee Yeung - Lipari 2004 82 Future Work Custom refinements to the modelbased implementation: Design models that incorporate specific information about the experiments, eg. Block diagonal covariance matrix Missing data Outliers 02/13/20 Ka Yee Yeung - Lipari 2004 83 Thank-yous Model-based work: Chris Fraley (Statistics, UW)

Alejandro Murua (Statistics, UW) Adrian Raftery (Statistics, UW) Larry Ruzzo (Computer Science, UW) Ovary data: Michel Schummer (Institute of Systems Biology) 02/13/20 Ka Yee Yeung - Lipari 2004 84 Common questions 1. How can I choose between all these clustering methods? 2. Is there a clustering algorithm that works better than the others? 3. How to choose the number of clusters? 4. How often do I get biologically meaningful clusters? 5. How many microarray experiments do I need? 02/13/20 Ka Yee Yeung - Lipari 2004 85

From co-expression to coregulation: How many microarray experiments do we need? [Yeung, Medvedovic, Bumgarner: To appear in Genome Biology 2004] 02/13/20 Ka Yee Yeung - Lipari 2004 86 From co-expression to coregulation Motivation: Genes sharing the same transcriptional modules are expected to produce similar expression patterns Cluster analysis is often used to identify genes that have similar expression patterns. Questions: 1. How likely are co-expressed genes regulated by the same transcription factors? 2. What is the effect of the following factors on the likelihood: a. Number of microarray experiments b. Clustering algorithm used

02/13/20 Ka Yee Yeung - Lipari 2004 87 Transcription factor databases Yeast microarray data Pre-processing Yeast genes regulated by the same TFs Randomly sampled subsets with E experiments cluster 02/13/20 For each pair of genes in the same cluster, evaluate if they share the same TFs Ka Yee Yeung - Lipari 2004

88 Yeast transcription factors SCPD (Saccharomyces Cerevisiae Promoter Database) [zhang et al. 1999] List ~235 genes that are regulated by 90 transcription factors (TFs) YPD (Yeast Protein Database) Commercial: UW does not have access Appendix of [Lee et al. 2002] List genes regulated by each TF from literature as of Nov 2001 List ~584 genes that are regulated by 120 TFs 02/13/20 Ka Yee Yeung - Lipari 2004 89 Comparing YPD and SCPD SCP D YPD

Commo n # distinct ORFs 235 584 156 # distinct TFs 108 120 34 # gene-TF interactions 473 1056 119 SCPD: 41/90 = 46% TFs regulate only 1 gene YPD: 17/120 = 14% TFs regulate only 1 gene In general, the YPD list contains TFs that regulate a higher # genes 02/13/20

Ka Yee Yeung - Lipari 2004 90 Yeast microarray data Rosettas yeast compendium data [Hughes et al. 2000] 300 knockout 2-color experiments Stanford: Gasch et al. Data [2000 and 2001] cDNA array data under a variety of environmental stress (eg. heat shock) Total 225 concatenated time course experiments 02/13/20 Ka Yee Yeung - Lipari 2004 91 Evaluation For each clustering result Count the number of pairs of genes that belong to the same cluster and share a common TF (True positive, TP) TPrate= # gene pairs from the same clusters and share at least 1 common TF' s

# gene pairs from the same clusters TP rate may change as a function of # clusters, we compare the TP rate to the TP rate of random partitions: Randomly partition the set of genes 1000 times Distribution of TP rate ~ Normal mean and standard deviation Z-score = (TP rate - ) / A high z-score TP rate is significantly higher than those of random partitions 02/13/20 Ka Yee Yeung - Lipari 2004 92 Results: Compendium data using all experiments To compare the performance of different clustering algorithms 02/13/20 Ka Yee Yeung - Lipari 2004 93 compendium data & SCPD (273

E) 16 average-link (corr) complete-link (corr) IMM Mclust average-link (dist) complete-link (dist) 14 12 10 8 6 z-score 4 2 0 -2 -4 5

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 # clusters MCLUST and complete-link (corr) produced relatively high zscores 02/13/20 Ka Yee Yeung - Lipari 2004 94 IMM (Infinite Mixture Modelbased) Infinite mixture model: Each cluster is assumed to follow a multivariate normal distribution do NOT assume # clusters Use a Gibbs Sampler to estimate the pairwise probabilities (Pij) for two genes (i,j) to belong to the same cluster To form clusters: cluster Pij with a heuristic clustering algorithm (eg. complete-link) Built-in error model Assume the repeated measurements are generated from another multivariate Gaussian distribution.

02/13/20 Ka Yee Yeung - Lipari 2004 95 Results: Compendium data: effect of # experiments 02/13/20 Ka Yee Yeung - Lipari 2004 96 Compendium data & SCPD: Hierarchical complete-link over a range of # clusters 5 experiments 20 experiments 100 experiments 273 experiments 14 12 10 experiments 50 experiments

200 experiments 10 8 6 4 2 median z-score 0 5 15 25 35 45 55 65 75 85 95

# clusters Observation: Median z-score increases as # experiments increases over different # clusters. 02/13/20 Ka Yee Yeung - Lipari 2004 97 Compendium data & SCPD: Different clustering algorithms at 25 clusters 12 average-link (corr) complete-link (corr) Mclust IMM 10 8 6 4 median z-score 2

0 5 10 20 50 100 273 # experiments Proportions of co-regulated genes increase as # experiments increases Mclust: highest proportions of co-regulated genes 02/13/20 Ka Yee Yeung - Lipari 2004 98 Compendium data & YPD (537G): Different clustering algorithms at 40 clusters 45

average-link (corr) 40 complete-link (corr) 35 Mclust 30 IMM 25 20 15 10 median z-score 5 0 5 10 20 50 100 258 # experiments 02/13/20 Ka Yee Yeung - Lipari 2004

99 Summary of Results More microarray experiments More likely to find co-regulated genes!! SCPD/YPD produces similar results Euclidean distance tends to produce relatively low z-score compared to correlation using the same algorithm Standardization greatly improves the performance of model-based methods Mclust (EI model) produces relatively high zscores IMM doesnt work as well as Mclust. Why?? 02/13/20 Ka Yee Yeung - Lipari 2004 100 ChIP-CHIPthe methodology Transcription factors are crosslinked to genomic DNA DNA is sheared Antibodies immunoprecipitate a specific transcription factor DNA is un-linked,

labeled and used to interrogate arrays * * 02/13/20 Ka Yee Yeung - Lipari 2004 101 ChIP data:3rd gold standard [Lee et al. Science 2002] Chromatin Immunoprecipitation (ChIP): to detect the binding of TFs of interest to integenic sequences in yeast in vivo 106 TFs from YPD (113 TFs in their raw data) Adopted error model from [Hughes et al. 2000] Raw data (log ratios and p-values for genes/integenic regions to TFs) available p-value cutoff = 0.001 02/13/20 Ka Yee Yeung - Lipari 2004 102

Comparing ChIP data and YPD 791 gene-TF interactions from YPD have a common gene and TF in the ChIP data p-values range 0 < x <= 0.001 0.001 < x <= 0.005 0.005 < x <= 0.01 0.01 < x <= 0.05 0.05 < x <= 0.1 0.1 < x <= 0.5 x > 0.5 p-value TP 0.0010 0.0050 0.0100 0.0500 0.1000 FN 159 196 218 278 318 02/13/20

# gene-TF interactions 159 37 22 60 40 232 241 total # ChIP interactions 632 595 573 513 473 % gene-TF interactions 20.10 4.68 2.78 7.59 5.06 29.33 30.47 # detected by ChIP but not in YPD 642 1101 1517

3655 6226 Ka Yee Yeung - Lipari 2004 421 775 1112 2759 4722 TP % FN % 20.10 79.90 24.78 75.22 27.56 72.44 35.15 64.85 40.20 59.80 103 Results: compendium data & ChIP (215 genes) 25

average-link (corr) complete-link (corr) Mclust IMM 20 15 10 5 median z-score 0 5 10 20 50 100 273 # experiments Very similar results on other datasets as well 02/13/20

Ka Yee Yeung - Lipari 2004 104 Take home message In order to reliably infer co-regulation from cluster analysis, we need lots of data. 02/13/20 Ka Yee Yeung - Lipari 2004 105 Limitations Very nave assumption of co-regulation: genes sharing at last one common transcription factors Yeast data only Does not take the information limit of microarray datasets into consideration Consider only clustering algorithms in which each gene is assigned to only one cluster Our current study does not provide completely quantitative results: how many experiments are sufficient to achieve x% co-regulation? 02/13/20 Ka Yee Yeung - Lipari 2004

106 Thank yous Roger Bumgarner (Microbiology, UW) Bumgarner Lab, UW: Tanveer Haider, Tao Peng, Mette Peters, Kyle Serikawa, Caimiao Wei Ted Young -- Biochemistry, UW IMM: Mario Medvedovic -- Univ of Cincinnati Mclust: Adrian Raftery + Chris Fraley (Statistics, UW) 02/13/20 Ka Yee Yeung - Lipari 2004 107 Summary Question Answer 1. How can I choose between different clustering methods? FOM: compare any clustering algorithms on any dataset

2. Is there a clustering algorithm that works better than the others? 3. How to choose the number of clusters? Model-based clustering algorithm: high cluster quality estimated number of clusters. 4. How often do I get biologically meaningful clusters? 5. How many experiments do I need? More experiments more likely to find co-regulated genes Model-based method in yeast, ~ 50 experiments 02/13/20 Ka Yee Yeung - Lipari 2004 108 Meet my collaborators and

mentors David Haynor Roger Bumgarn er 02/13/20 Adrian Raftery Mario Medvedovi c Ka Yee Yeung - Lipari 2004 Larry Ruzzo 109 Key References [Yeung, Haynor, Ruzzo 2001] Validating clustering for gene expression data. Bioinformatics 17:309318 [Yeung, Fraley, Murua, Raftery, Ruzzo 2001] Modelbased clustering and data transformations for gene expresison data. Bioinformatics 17:977-987 [Yeung, Medvedovic, Bumgarner 2004] From coexpression to co-regulation: how many microarray

experiments do we need? To appear in Genome Biology. http://faculty.washington.edu/kayee/ 02/13/20 Ka Yee Yeung - Lipari 2004 110 Common questions 1. How can I choose between all these clustering methods? 2. Is there a clustering algorithm that works better than the others? 3. How to choose the number of clusters? 4. How often do I get biologically meaningful clusters? 5. How many microarray experiments do I need? 6. How to best take advantage of repeated measurements from microarray data? 02/13/20 Ka Yee Yeung - Lipari 2004 111 Clustering microarray data with repeated measurements

[Yeung, Medvedovic, Bumgarner 2003] [Medvedovic, Yeung, Bumgarner 2004] 02/13/20 Ka Yee Yeung - Lipari 2004 112 Array data is noisy 1.8 1.6 1.4 1.2 1.0 0.8 0.6 SD of log (R/G) 0.4 0.2 0.0 2.0 2.5 3.0 3.5

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 average intensity (log R + log G) 1400 1200 1000

800 600 # data points 400 200 0 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

7.5 8.0 8.5 9.0 9.5 10.0 average intensity (log R + log G) 02/13/20 Ka Yee Yeung - Lipari 2004 113 Observations There is variability in all measurements, including gene expression measurements. Repeated measurements allow one to estimate the variability. 02/13/20 Ka Yee Yeung - Lipari 2004

114 Our hypothesis Incorporating variability estimates or repeated measurements into clustering algorithms Better results 02/13/20 Ka Yee Yeung - Lipari 2004 115 Illustration Clustering genes Example with variability of measurements: 02/13/20 E1 Clustering experiments E2 E3

E4 Gene 1 -20.2 +20.3 +20.3 -10.1 Gene 2 +80.8 +30.4 00.2 +40.5 Gene 3 -40.5 +520 +410 -20.8 Gene 4 -10.1 +40.2

+30.3 -10.1 Ka Yee Yeung - Lipari 2004 116 How to cluster microarray data with repeated measurements? Average over repeated measurements Variability-weighted similarity measures 2000] [Hughes et al. Down-weight noisy measurements in computing pairwise similarities Infinite mixture model (IMM) [Medvedovic et al. 2002] Each cluster is assumed to follow a multivariate normal distribution Built-in error model for repeated measurements do NOT assume # clusters 02/13/20

Ka Yee Yeung - Lipari 2004 117 IMM (Infinite Mixture Modelbased) Infinite mixture model: Use a Gibbs Sampler to estimate the pairwise probabilities (Pij) for two genes (i,j) to belong to the same cluster To form clusters: cluster Pij with a heuristic clustering algorithm (eg. complete-link) Auto complete-link Clusters are groups of genes for which there exists at least one pair of genes s.t. the probability of them being co-expressed is 0 --> cluster distance = 1. Built-in error model Assume the repeated measurements are generated from another multivariate Gaussian distribution. 02/13/20 Ka Yee Yeung - Lipari 2004 118

Assessing cluster quality Generate synthetic data sets with realistic error distributions, for which true clusters (classes) are known. Low noise level 02/13/20 High noise level Ka Yee Yeung - Lipari 2004 119 How to define true/false positives/negatives? Problem: it is difficult to assign clusters to classes, especially when the cluster quality is poor Pairwise approach: Same cluster Different cluster Same class True positives False negatives

Different class False positives 02/13/20 Ka Yee Yeung - Lipari 2004 True negatives 120 Typical Results Synthetic data 4 repeated measurements 4 repeated no repeated measuremen measureme ts nt Algorithm average-link

with correlation IMM IMM True positive 15% 16% 13% False negative 2% 0% 4% False positive 12% 0% 18%

True negative 71% 84% 65% 02/13/20 Ka Yee Yeung - Lipari 2004 121 Summary of Results Repeated measurements significantly improve the quality of clustering results IMM with built-in error model: relatively high cluster quality auto IMM: Reasonable estimates of # clusters 02/13/20 Ka Yee Yeung - Lipari 2004 122 Software Implementation command-line C++ code for now

IMM tutorial available http://expression.microslu.washington.edu/expression/kayee/c luster2003/yeunggb2003.html 02/13/20 Ka Yee Yeung - Lipari 2004 123 Thank-yous Roger Bumgarner Mario Medvedovic 02/13/20 Ka Yee Yeung - Lipari 2004 124 Summary: clustering Question Answer 1. How can I choose between different clustering methods? FOM: compare any clustering algorithms on any dataset

2. Is there a clustering algorithm that works better than the others? 3. How to choose the number of clusters? Model-based clustering algorithm: high cluster quality estimated number of clusters. 4. How to best take advantage of repeated measurements? IMM: built in probabilistic error model 5. How often do I get biologically meaningful clusters? 6. How many experiments do I need? More experiments more likely to find co-regulated genes Model-based method in yeast, ~ 50 experiments

02/13/20 Ka Yee Yeung - Lipari 2004 125 DNA Hybridization 02/13/20 Ka Yee Yeung - Lipari 2004 126 Background: Transcription 101 Transcription: DNA -> RNA Transcription factors: proteins Promoters (or transcription binding sites) 02/13/20 Ka Yee Yeung - Lipari 2004 127

Recently Viewed Presentations

  • Examples of Spatial Diffusion: Disease Diffusion

    Examples of Spatial Diffusion: Disease Diffusion

    Diffusion The movement of a phenomenon from one location to another. Disease Diffusion Refers to the spread of a disease into new locations Barriers to Diffusion Some physical features act as a barrier towards diffusion, including: Mountains Bodies of water...
  • Brooklyn Fire Department - City Tech OpenLab

    Brooklyn Fire Department - City Tech OpenLab

    1824: The Brooklyn fire department consisted of just four engine companies and one hook and ladder company . 1825: June 25th, $1400 allocated for the use of an engine and a house , located selected to be on the south...
  • Colonial Games and Toys - TeacherTube

    Colonial Games and Toys - TeacherTube

    Early American, Nursery Rhymes, and Tongue Twisters
  • The Odyssey

    The Odyssey

    tells about the adventures of Odysseus, one of the greatest heroes of the Trojan War. Initially, Odysseus didn't wish to fight in the Trojan War. ... The Cyclopes were sons of the sea god Poseidon. ... Caught Between Scylla and...
  • Research Methods - Francis Marion University

    Research Methods - Francis Marion University

    Idiographic vs. Nomothetic data. Idiographic. refers to the individual. Nomothetic - Of or relating to the study or discovery of general scientific laws. When we use nomothetic data we gain and. We lose specificity to the individual but we gain...
  • John Dewey Education is not a preparation for

    John Dewey Education is not a preparation for

    John Dewey "Education is not a preparation for life; Education is life itself." Biographical Information 1859 Born October 20 1875 Majored in Philosophy at University of Vermont 1879 Taught 2 years of High School in PA 1882 Studied Philosophy and...
  • Balanced Scorecard Overview, DTD May 03

    Balanced Scorecard Overview, DTD May 03

    Level 1 Metrics Force Closure Force Readiness Combat Capability Supplier P,S,M,D,SR,DR NAM P1 +S1,S2,S3 Log Chain Manager P Range Rqmt Fuel Specs Fuel Rqmt Collaboration on PBA PBA Consumer POM, Plan, Detailed Rqmts PM P Customer P Collaboration Role Physical...
  • Bio 342 Human Physiology A physiologist asks  How

    Bio 342 Human Physiology A physiologist asks How

    *SC SDE (Pat Mohr). Adapted from Lorin W. Anderson, David R. Krathwohl et al (Eds.) A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives 2001; modified by Ellen Goldey, Wofford College, to incorporate "Biology...