Kernel Methods on Genomics Data - Purdue University

Kernel Methods on Genomics Data - Purdue University

Kernel Methods for large-scale Genomics Data Analysis Wang et al. Background Machine learning (ML) has been illustrated as a promising tool to deal with challenges regarding data growth in

genomics ML methods can be used to learn how a very large number of genetic variants (SNPs) are associated with complex phenotypes (diseases, disorders etc.) This study highlights potential roles that ML, particularly kernel methods, will have in modern genomics

Kernels for Genomic Data Kernel methods are based on mathematical functions that smooth data They allow us to use linear classifiers to solve non-linear problems by transforming the non-linearly separable data Kernels for Genomic Data contd.

Some advantages to kernel methods over traditional regression methods are the following: o Allowance for high-dimensional genomic data o

Allowance for nonlinear relations between outcomes and the genomic data o Flexibility to include structural information

Kernels for Genomic Data contd. A key component for a kernel is a kernel function The function converts info for a pair of subjects into a quantitative measure representing their similarity with respect to genetic data For GWAS studies, the weighted linear kernel is popular

Kernels for Genomic Data contd. For the weighted kernel function, SNPs are coded as G and G has values 0, 1, or 2 based on the number of the minor allele essentially encoding homozygous or heterozygous For q SNPs, the weighted function for subjects i and j can be expressed as:

Kernels for Genomic Data contd. weights each SNP and is expressed as the standard error of the estimated minor allele frequency (MAF):

Other types of weights can be used and higher-order polynomial functions can be used for higher-order interactions

Ultimately, other types of kernels besides the weighted kernel can also be used Building Predictive Models The goal is to be able to predict phenotypes for different individuals based on known genomic data (supervised ML) A common practice is to build the prediction model based on top-ranked markers from GWAS and a few experimentally

known susceptibility markers (cherry picking) Building Predictive Models contd. Another strategy is to train the model using all the available markers as well as all other available information such as epigenetic markers

Building Predictive Models contd. For disease risk prediction, support vector machines (SVMs) may be used. SVMs are a well-developed method seeking an optimal hyperplane that separates the data into 2 classes maximizing the margin

Kernel Trick Nonlinear classification is attained by using the kernel trick: mapping the non-linear separable data-set into a higher dimensional space where we can find a hyperplane that can separate the samples Building Predictive Models contd.

SVMs are advantageous for high-dimensional genomic data: o Ability to deal with all markers without any pre-pruning or selection o

Accounts for complex relationships amongst markers However, SVMs are black-box approaches that only provide classification and it is difficult to extract more information Building Predictive Models contd.

Another potential classifier is kernel logistic regression (KLR) KLR offers a natural estimate of probability and adapts to other probabilistic approaches The hinge losses of KLR and SVM are actually very similar and the methods have similar expected performance but the significant differences lie in their applications

Building Predictive Models contd. Many strategies can be adopted to improve whole-genome risk prediction One strategy is exploiting block structure underlying genomic data Using kernel based methods does not necessarily drastically improve the prediction

Multiple Kernel Learning (MKL) Instead of selecting a fixed single kernel, multiple kernel learning (MKL) uses multiple candidate kernels to map the data into the other space MKL achieves better performance by finding optimal weights for each base kernel

Multiple Kernel Learning (MKL) contd. MKL seeks to make a composite kernel as a linear combination of different kernels where model complexity is controlled by regularization Applications of MKL in genomic data are currently limited but will increase alongside large-scale genomics data

Genomic Data Fusion A closely related concept to MKL is kernel-based data fusion Both data fusion and MKL are facilitated by the closure property of kernels (sum or weighted sum of kernels is another valid kernel) Kernel fusion methods allow for integration of data with

different types (gene expression, DNA methylation, CNV etc.) and structures in function prediction Structured Association Mapping Leveraging structural information unlike traditional teststatistics-based or PCA-based methods () Using various structural information present in the genome (phenome and transcriptome) to improve accuracy of

identifying causal variants Structured Association Mapping contd. An important source of genome structural info is genome annotations (known binding sites, exon regions etc.) This data can be considered as prior knowledge about SNPs

to be used to search for disease susceptibility markers For example, SNPs in highly conserved regions are more likely to have true associations since conserved regions are functionally important Discussion Kernel machine learning methods offer potential tools for

large-scale and high-dimensional data analysis, for genomics in particular Kernel ML methods can be integrated with classical ML techniques Future work involves improving the scalability to sample size, dimensionality and data heterogeneity

Reference Wang, Xuefeng, Eric P. Xing, and Daniel J. Schaid. Kernel Methods for Large-Scale Genomic Data Analysis. Briefings in Bioinformatics 16, no. 2 (March 2015): 18392. https://doi.org/10.1093/bib/bbu024.

Recently Viewed Presentations

  • Integrated analysis of biomedical data: from connectivity to ...

    Integrated analysis of biomedical data: from connectivity to ...

    The Omics data deluge (2) 1.2M cancer mutation profiles in COSMIC. How can we exploit this richness? 218 databases listed - in reality there are probably at least 10 times as many
  • NIH Consensus Development Conference: Inhaled Nitric Oxide Therapy

    NIH Consensus Development Conference: Inhaled Nitric Oxide Therapy

    Inhaled Nitric Oxide Therapy for Premature Infants F. Sessions Cole, M.D. Panel and Conference Chairperson Dr. Cole is the Park J. White, M.D. Professor of Pediatrics, Vice Chancellor for Children's Health, and Vice Chairperson for the Department of Pediatrics, and...
  • The Professional Development Officer for Beginners

    The Professional Development Officer for Beginners

    Mitchell = Level 1. Earhart = + Cadet Programs Tech Rating. Eaker = + OBC, SLS, & Level II. Spaatz = + Cadet Programs Senior Rating & Yeager. Not automatic, must be submitted to NHQ. CAPF 11 Director's Report to...
  • Looking at Limericks Limericks are humorous verses which

    Looking at Limericks Limericks are humorous verses which

    Looking at Limericks Limericks are humorous verses which are made up of five lines. There was an old man of Khartoum Who kept a tame sheep in his room, "To remind me," he said, "Of someone who's dead, But I...
  • Jeopardy prompt and response template - Edl

    Jeopardy prompt and response template - Edl

    Review Jeopardy Eng 4 Cognate terms Antonym 100 Points Antonym 100 Points unrelated Antonym 200 Points A risible situation poignant Antonym 200 Points Antonym 300 Points An evening full of folderol Antonym 300 Points significance In a bilious frame of...
  • Impactful Faculty Advising: As Frontline Advisors Ronald J

    Impactful Faculty Advising: As Frontline Advisors Ronald J

    Adjunct Faculty Profile. Michael Pickarski. Experience - 8 years as a financial advisor and owner of an accounting firm. Connection - "I try to incorporate current events into the daily teachings to see if there is something that is happening...
  • Topics What will you research to find an

    Topics What will you research to find an

    Scopes Monkey Trial/McCarthyism into . Inherit the Wind . by Jerome Lawrence and Robert Edwin Lee. Triangle Shirtwaist Fire and Trial (1911): a fire started in a factory that employed many young women. Doors were locked and there were no...
  • John's Letters - Baptist Start

    John's Letters - Baptist Start

    The Letters of John Life Light First John 1:5-2:2 Page 1034 in Pew Bibles Love Confidence Born of God That you may know First John 1:5-10 Chapter 2 The Letters of John First John 2:1-2 Page 1034 in Pew Bibles...