IMPACT: Interval-based Multipass Proteomic Alignment with Constant Traceback

IMPACT: Interval-based Multipass Proteomic Alignment with Constant Traceback

IMPACT: Interval-based Multipass Proteomic Alignment with Constant Traceback Sahand Kashani, Stuart Byma, James Larus 2019/02/16 Phylogenetics The study of evolutionary history between species Image credit: http://www.oceanographerschoice.com/2009/06/wicked-cool-phylogenetic-tree/ AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 2 Evolutionary biology through protein analysis Find relationships between species proteins Cluster similar proteins together Similarity determined through sequence alignment DNA

Proteins Nucleotide chains Amino-acid chains A, C, G, T AACBB'19, 2019/02/16 A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y Sahand Kashani, Stuart Byma, James Larus 3 Alignment runtime extremes Smith-Waterman in sequence length Wide variance in protein lengths Need an algorithm that Works well on short and long proteins

Can be accelerated in hardware 4 orders of magnitude AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 4 Outline Background (optimal vs. heuristic sequence alignment) Hardware-friendly heuristic alignment for DNA sequences [1] Adapting hardware-friendly heuristic alignment to protein sequences Challenges around protein characteristics Algorithmic alignment improvements Quality of results & discussion Future work [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly [Turakhia, Bejerano, Dally], ASPLOS18 AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus

5 Optimal alignment A Given 2 sequences (query sequence) (reference sequence) 2 C -1 -1 -1 -1 2 G -1 -1 T Cell dependencies: -1 -1 2

-1 -1 -1 -1 2 A T C G Insert gaps in sequences such that and they align Subject to some scoring mechanism A A C G T

C 2 -1 -1 -1 G -1 2 -1 -1 T -1 -1 2 -1 -1 -1 -1 2

Expensive in time & space Compute matrix Store traceback pointers Perform traceback AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 6 Heuristic alignment Left extension Prune search space Take seeds of fixed size from Find seed hits in

Extend hits to the left (and to the right) Perform traceback Good acceleration requires Traceback done in hardware Keep all traceback state in on-chip memory Restricts size of the extension Cannot find long extensions AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 7 Outline Background (optimal vs. heuristic sequence alignment) Hardware-friendly heuristic alignment for DNA sequences [1] Adapting hardware-friendly heuristic alignment to protein sequences Challenges around protein characteristics Algorithmic alignment improvements Quality of results & discussion Future work [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly

[Turakhia, Bejerano, Dally], ASPLOS18 AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 8 Heuristic hardware-friendly alignment ( , )=(5 , 2) GACT algorithm [1] Overlapping tile-based alignment Tile size, Overlap amount, Algorithm Compute single tile Traceback to within cells of tile border Place next tile at traceback endpoint Constant on-chip traceback state memory Traceback can be done in hardware [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly [Turakhia, Bejerano, Dally], ASPLOS18 AACBB'19, 2019/02/16

Sahand Kashani, Stuart Byma, James Larus 9 GACT performance Developed for DNA long read assembly Heuristic But finds optimal extensions for Open questions Can GACT be adapted to obtain optimal protein extensions? Are values of applicable to other datasets? AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 10 Outline Background (optimal vs. heuristic sequence alignment) Hardware-friendly heuristic alignment for DNA sequences [1] Adapting hardware-friendly heuristic alignment to protein sequences

Challenges around protein characteristics Algorithmic alignment improvements Quality of results & discussion Future work [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly [Turakhia, Bejerano, Dally], ASPLOS18 AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 11 Design space exploration DNA Better alignments by increasing only / only / both Proteins 250K alignments

Never reach 100% optimal alignments QoR decreases with high overlap AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 12 Classifying GACT behavior Obtains optimal alignment Premature traceback termination Sensitivity to GACTs placement of current tile Traceback divergence Traceback in previous tile drives placement of current tile Erroneous traceback in previous tile current tile will be misplaced AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 13

Premature traceback termination Proteins are sensitive to tile placement More tile overlap better alignment Occasionally results in worse alignments AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus ( , )=(32, 6) 7) 14 DNA vs. protein substitution matrices A 2 -1 -1 -1 C -1 2 -1 -1 DNA

Homogeneous matrix E.g. +2 for matches E.g. -1 for mismatches G -1 -1 2 -1 T -1 -1 -1 2 A C G T Proteins Heterogeneous matrix All Some matches weigh mismatches weigh High magnitude variations Some amino-acid pairs weigh others

AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 15 GACT hardware resource requirements GACT can be efficiently implemented as a systolic array of Processing Elements (PE) Each PE handles 1 cell in scoring matrix DNA Proteins Homogeneous substitution matrix PE hardware resources Heterogeneous substitution matrix PE hardware resources 1 comparator, 2 registers, 1 multiplexor 1 on-chip memory to store matrix

Large number of PEs can fit on a device Limits number of PEs that can fit on a device Increased GACT alignment throughput Decreased GACT alignment throughput AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 16 Traceback divergence ( , )=(32, 3) Minor traceback error in previous tile Large traceback error in current tile Can solve issue by Increasing

But we want to avoid increasing on-chip memory constraints Tile misplaced What we would like Algorithm that can achieve better alignments with small Quadratically less resources AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 17 Improving the heuristic Multi-pass GACT Idea Traceback drives placement of tiles Increase confidence in previous tiles traceback Reduce probability of divergence in current tile Multiple passes over 2 consecutive tiles

Compute tile 1 Traceback until border Compute tentative tile 2 Recompute tile 1 (using elevated overlap region) Final traceback of tile 1 AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 18 Quality of results & discussion Dataset 11 bacterial genomes ( 25K proteins) A few outliers with length > 30K 250K protein alignments

Results Alignment scores improve between 0% and 29000% On average 14% better (few alignments are very long) Many cases where multi-pass GACT does not work well Large vertical/horizontal stretches in the traceback Tiles placed diagonally cannot cross a large vertical/horizontal stretch AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 19 Outline Background (optimal vs. heuristic sequence alignment) Hardware-friendly heuristic alignment for DNA sequences [1] Adapting hardware-friendly heuristic alignment to protein sequences Challenges around protein characteristics Algorithmic alignment improvements Quality of results & discussion Future work [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly [Turakhia, Bejerano, Dally], ASPLOS18

AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 20 Seeding a protein-based heuristic aligner Observations Large protein alphabet size Long seed hits improbable Multi-pass GACT works well on relatively diagonal tracebacks Idea Use piecewise alignment algorithm Seeding algorithm can help with this Tile alignment matrix

Compute small local alignments in parallel Find stretches of high-scoring tiles Merge stretches AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 21 Conclusion Protein alignment is compute-intensive hardware-acceleration Long alignments (lengths span 4 orders of magnitude) High on-chip memory resource requirements (large alphabet) Skewed substitution matrices existing heuristic aligners miss alignments Proposed algorithmic improvements for a hardware-friendly aligner Trades linearly more work for quadratically less resources Increases protein sequence alignment scores Avg. 14% Max. 29000% (long proteins) AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus

22

Recently Viewed Presentations

  • Prezentace aplikace PowerPoint - EURION

    Prezentace aplikace PowerPoint - EURION

    Běžné oddělené schéma realizace Public Private Partnership metoda PPP - varianta 1 - projektová společnost Public Private Partnership metoda PPP - varianta 2 - Hlavní dodavatel „fronter" Je privátní sektor ochoten a připraven realizovat PPP projekty ?
  • Spain Builds an American Empire - Weebly

    Spain Builds an American Empire - Weebly

    Spain Builds an American Empire Mr. C's 7th grade social studies. Main Idea-empire building The Voyages of Christopher Columbus prompted the Spanish to establish colonies in The Americas Why it matters now Throughout the Americas, Spanish culture and language, and...
  • A Raisin in the Sun - Ms. Mertz's Class Web Page

    A Raisin in the Sun - Ms. Mertz's Class Web Page

    A Raisin in the Sun By Lorraine Hansberry HISTORICAL CONTEXT Racism and Segregation Jim Crow Laws The Main Idea Transformations in the African American community contributed to a blossoming of black culture centered in Harlem, New York.
  • AspenCat user group - oclc.org

    AspenCat user group - oclc.org

    Zportal & Z39.50. ILLiad. Innovative Integrated Library System . Millennium/Sierra. INN-REACH. Various other ILS. Colorado. VDX (standing for Virtual Document . eXchange
  • Electricity & Magnetism

    Electricity & Magnetism

    The pathway that an electric current follows is called an electric circuit. A circuit is a closed pathway. This closed, or complete pathway, has no gaps or openings. If there is an opening, we call that an open circuit, or...
  • The Most Dangerous Game by Richard Connell Bell

    The Most Dangerous Game by Richard Connell Bell

    Bell Ringer: Freshman Reflection. Turn in "The Interlopers" worksheet - due today Grab a highlighter (to be used later) Since we read "The Rules of High School", what (so far) has been true?
  • #IAS2017 | @IAS_conferenc #FastTrackCities | @IAPAC NAIROBIS FAST-TRACK

    #IAS2017 | @IAS_conferenc #FastTrackCities | @IAPAC NAIROBIS FAST-TRACK

    Highly mobile and unstable population with unclear settlement areas thus making it difficult to follow-up. Data data/reports timeliness and completeness; We still have data quality gaps. Low uptake of EMR. About 60% of population seek health care in both formal...
  • Leksičko prepoznavanje u obradi prirodnih jezika

    Leksičko prepoznavanje u obradi prirodnih jezika

    Morfološki e-rečnik srpskog jezika (SrpMD)Srpski jezik je iz porodice slovenskih jezika - ima bogatu flektivnu i derivacionu morfologiju; Rečnici su u formatu koji je poznat kao LADL - nastao prvo za francuski pod rukovodstvom Morisa Grosa