Kernel Methods on Genomics Data - Purdue University
Kernel Methods for large-scale Genomics Data Analysis Wang et al. Background Machine learning (ML) has been illustrated as a promising tool to deal with challenges regarding data growth in
genomics ML methods can be used to learn how a very large number of genetic variants (SNPs) are associated with complex phenotypes (diseases, disorders etc.) This study highlights potential roles that ML, particularly kernel methods, will have in modern genomics
Kernels for Genomic Data Kernel methods are based on mathematical functions that smooth data They allow us to use linear classifiers to solve non-linear problems by transforming the non-linearly separable data Kernels for Genomic Data contd.
Some advantages to kernel methods over traditional regression methods are the following: o Allowance for high-dimensional genomic data o
Allowance for nonlinear relations between outcomes and the genomic data o Flexibility to include structural information
Kernels for Genomic Data contd. A key component for a kernel is a kernel function The function converts info for a pair of subjects into a quantitative measure representing their similarity with respect to genetic data For GWAS studies, the weighted linear kernel is popular
Kernels for Genomic Data contd. For the weighted kernel function, SNPs are coded as G and G has values 0, 1, or 2 based on the number of the minor allele essentially encoding homozygous or heterozygous For q SNPs, the weighted function for subjects i and j can be expressed as:
Kernels for Genomic Data contd. weights each SNP and is expressed as the standard error of the estimated minor allele frequency (MAF):
Other types of weights can be used and higher-order polynomial functions can be used for higher-order interactions
Ultimately, other types of kernels besides the weighted kernel can also be used Building Predictive Models The goal is to be able to predict phenotypes for different individuals based on known genomic data (supervised ML) A common practice is to build the prediction model based on top-ranked markers from GWAS and a few experimentally
known susceptibility markers (cherry picking) Building Predictive Models contd. Another strategy is to train the model using all the available markers as well as all other available information such as epigenetic markers
Building Predictive Models contd. For disease risk prediction, support vector machines (SVMs) may be used. SVMs are a well-developed method seeking an optimal hyperplane that separates the data into 2 classes maximizing the margin
Kernel Trick Nonlinear classification is attained by using the kernel trick: mapping the non-linear separable data-set into a higher dimensional space where we can find a hyperplane that can separate the samples Building Predictive Models contd.
SVMs are advantageous for high-dimensional genomic data: o Ability to deal with all markers without any pre-pruning or selection o
Accounts for complex relationships amongst markers However, SVMs are black-box approaches that only provide classification and it is difficult to extract more information Building Predictive Models contd.
Another potential classifier is kernel logistic regression (KLR) KLR offers a natural estimate of probability and adapts to other probabilistic approaches The hinge losses of KLR and SVM are actually very similar and the methods have similar expected performance but the significant differences lie in their applications
Building Predictive Models contd. Many strategies can be adopted to improve whole-genome risk prediction One strategy is exploiting block structure underlying genomic data Using kernel based methods does not necessarily drastically improve the prediction
Multiple Kernel Learning (MKL) Instead of selecting a fixed single kernel, multiple kernel learning (MKL) uses multiple candidate kernels to map the data into the other space MKL achieves better performance by finding optimal weights for each base kernel
Multiple Kernel Learning (MKL) contd. MKL seeks to make a composite kernel as a linear combination of different kernels where model complexity is controlled by regularization Applications of MKL in genomic data are currently limited but will increase alongside large-scale genomics data
Genomic Data Fusion A closely related concept to MKL is kernel-based data fusion Both data fusion and MKL are facilitated by the closure property of kernels (sum or weighted sum of kernels is another valid kernel) Kernel fusion methods allow for integration of data with
different types (gene expression, DNA methylation, CNV etc.) and structures in function prediction Structured Association Mapping Leveraging structural information unlike traditional teststatistics-based or PCA-based methods () Using various structural information present in the genome (phenome and transcriptome) to improve accuracy of
identifying causal variants Structured Association Mapping contd. An important source of genome structural info is genome annotations (known binding sites, exon regions etc.) This data can be considered as prior knowledge about SNPs
to be used to search for disease susceptibility markers For example, SNPs in highly conserved regions are more likely to have true associations since conserved regions are functionally important Discussion Kernel machine learning methods offer potential tools for
large-scale and high-dimensional data analysis, for genomics in particular Kernel ML methods can be integrated with classical ML techniques Future work involves improving the scalability to sample size, dimensionality and data heterogeneity
Reference Wang, Xuefeng, Eric P. Xing, and Daniel J. Schaid. Kernel Methods for Large-Scale Genomic Data Analysis. Briefings in Bioinformatics 16, no. 2 (March 2015): 18392. https://doi.org/10.1093/bib/bbu024.
Inhaled Nitric Oxide Therapy for Premature Infants F. Sessions Cole, M.D. Panel and Conference Chairperson Dr. Cole is the Park J. White, M.D. Professor of Pediatrics, Vice Chancellor for Children's Health, and Vice Chairperson for the Department of Pediatrics, and...
Looking at Limericks Limericks are humorous verses which are made up of five lines. There was an old man of Khartoum Who kept a tame sheep in his room, "To remind me," he said, "Of someone who's dead, But I...
Adjunct Faculty Profile. Michael Pickarski. Experience - 8 years as a financial advisor and owner of an accounting firm. Connection - "I try to incorporate current events into the daily teachings to see if there is something that is happening...
Scopes Monkey Trial/McCarthyism into . Inherit the Wind . by Jerome Lawrence and Robert Edwin Lee. Triangle Shirtwaist Fire and Trial (1911): a fire started in a factory that employed many young women. Doors were locked and there were no...
The Letters of John Life Light First John 1:5-2:2 Page 1034 in Pew Bibles Love Confidence Born of God That you may know First John 1:5-10 Chapter 2 The Letters of John First John 2:1-2 Page 1034 in Pew Bibles...
Ready to download the document? Go ahead and hit continue!