IMPACT: Interval-based Multipass Proteomic Alignment with Constant Traceback Sahand Kashani, Stuart Byma, James Larus 2019/02/16 Phylogenetics The study of evolutionary history between species Image credit: http://www.oceanographerschoice.com/2009/06/wicked-cool-phylogenetic-tree/ AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 2 Evolutionary biology through protein analysis Find relationships between species proteins Cluster similar proteins together Similarity determined through sequence alignment DNA
Proteins Nucleotide chains Amino-acid chains A, C, G, T AACBB'19, 2019/02/16 A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y Sahand Kashani, Stuart Byma, James Larus 3 Alignment runtime extremes Smith-Waterman in sequence length Wide variance in protein lengths Need an algorithm that Works well on short and long proteins
Can be accelerated in hardware 4 orders of magnitude AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 4 Outline Background (optimal vs. heuristic sequence alignment) Hardware-friendly heuristic alignment for DNA sequences [1] Adapting hardware-friendly heuristic alignment to protein sequences Challenges around protein characteristics Algorithmic alignment improvements Quality of results & discussion Future work [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly [Turakhia, Bejerano, Dally], ASPLOS18 AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus
5 Optimal alignment A Given 2 sequences (query sequence) (reference sequence) 2 C -1 -1 -1 -1 2 G -1 -1 T Cell dependencies: -1 -1 2
-1 -1 -1 -1 2 A T C G Insert gaps in sequences such that and they align Subject to some scoring mechanism A A C G T
C 2 -1 -1 -1 G -1 2 -1 -1 T -1 -1 2 -1 -1 -1 -1 2
Expensive in time & space Compute matrix Store traceback pointers Perform traceback AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 6 Heuristic alignment Left extension Prune search space Take seeds of fixed size from Find seed hits in
Extend hits to the left (and to the right) Perform traceback Good acceleration requires Traceback done in hardware Keep all traceback state in on-chip memory Restricts size of the extension Cannot find long extensions AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 7 Outline Background (optimal vs. heuristic sequence alignment) Hardware-friendly heuristic alignment for DNA sequences [1] Adapting hardware-friendly heuristic alignment to protein sequences Challenges around protein characteristics Algorithmic alignment improvements Quality of results & discussion Future work [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly
[Turakhia, Bejerano, Dally], ASPLOS18 AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 8 Heuristic hardware-friendly alignment ( , )=(5 , 2) GACT algorithm [1] Overlapping tile-based alignment Tile size, Overlap amount, Algorithm Compute single tile Traceback to within cells of tile border Place next tile at traceback endpoint Constant on-chip traceback state memory Traceback can be done in hardware [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly [Turakhia, Bejerano, Dally], ASPLOS18 AACBB'19, 2019/02/16
Sahand Kashani, Stuart Byma, James Larus 9 GACT performance Developed for DNA long read assembly Heuristic But finds optimal extensions for Open questions Can GACT be adapted to obtain optimal protein extensions? Are values of applicable to other datasets? AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 10 Outline Background (optimal vs. heuristic sequence alignment) Hardware-friendly heuristic alignment for DNA sequences [1] Adapting hardware-friendly heuristic alignment to protein sequences
Challenges around protein characteristics Algorithmic alignment improvements Quality of results & discussion Future work [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly [Turakhia, Bejerano, Dally], ASPLOS18 AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 11 Design space exploration DNA Better alignments by increasing only / only / both Proteins 250K alignments
Never reach 100% optimal alignments QoR decreases with high overlap AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 12 Classifying GACT behavior Obtains optimal alignment Premature traceback termination Sensitivity to GACTs placement of current tile Traceback divergence Traceback in previous tile drives placement of current tile Erroneous traceback in previous tile current tile will be misplaced AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 13
Premature traceback termination Proteins are sensitive to tile placement More tile overlap better alignment Occasionally results in worse alignments AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus ( , )=(32, 6) 7) 14 DNA vs. protein substitution matrices A 2 -1 -1 -1 C -1 2 -1 -1 DNA
Homogeneous matrix E.g. +2 for matches E.g. -1 for mismatches G -1 -1 2 -1 T -1 -1 -1 2 A C G T Proteins Heterogeneous matrix All Some matches weigh mismatches weigh High magnitude variations Some amino-acid pairs weigh others
AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 15 GACT hardware resource requirements GACT can be efficiently implemented as a systolic array of Processing Elements (PE) Each PE handles 1 cell in scoring matrix DNA Proteins Homogeneous substitution matrix PE hardware resources Heterogeneous substitution matrix PE hardware resources 1 comparator, 2 registers, 1 multiplexor 1 on-chip memory to store matrix
Large number of PEs can fit on a device Limits number of PEs that can fit on a device Increased GACT alignment throughput Decreased GACT alignment throughput AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 16 Traceback divergence ( , )=(32, 3) Minor traceback error in previous tile Large traceback error in current tile Can solve issue by Increasing
But we want to avoid increasing on-chip memory constraints Tile misplaced What we would like Algorithm that can achieve better alignments with small Quadratically less resources AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 17 Improving the heuristic Multi-pass GACT Idea Traceback drives placement of tiles Increase confidence in previous tiles traceback Reduce probability of divergence in current tile Multiple passes over 2 consecutive tiles
Compute tile 1 Traceback until border Compute tentative tile 2 Recompute tile 1 (using elevated overlap region) Final traceback of tile 1 AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 18 Quality of results & discussion Dataset 11 bacterial genomes ( 25K proteins) A few outliers with length > 30K 250K protein alignments
Results Alignment scores improve between 0% and 29000% On average 14% better (few alignments are very long) Many cases where multi-pass GACT does not work well Large vertical/horizontal stretches in the traceback Tiles placed diagonally cannot cross a large vertical/horizontal stretch AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 19 Outline Background (optimal vs. heuristic sequence alignment) Hardware-friendly heuristic alignment for DNA sequences [1] Adapting hardware-friendly heuristic alignment to protein sequences Challenges around protein characteristics Algorithmic alignment improvements Quality of results & discussion Future work [1] Darwin: A Genomics Co-processor Provides Up to 15,000x Acceleration on Long Read Assembly [Turakhia, Bejerano, Dally], ASPLOS18
AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 20 Seeding a protein-based heuristic aligner Observations Large protein alphabet size Long seed hits improbable Multi-pass GACT works well on relatively diagonal tracebacks Idea Use piecewise alignment algorithm Seeding algorithm can help with this Tile alignment matrix
Compute small local alignments in parallel Find stretches of high-scoring tiles Merge stretches AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus 21 Conclusion Protein alignment is compute-intensive hardware-acceleration Long alignments (lengths span 4 orders of magnitude) High on-chip memory resource requirements (large alphabet) Skewed substitution matrices existing heuristic aligners miss alignments Proposed algorithmic improvements for a hardware-friendly aligner Trades linearly more work for quadratically less resources Increases protein sequence alignment scores Avg. 14% Max. 29000% (long proteins) AACBB'19, 2019/02/16 Sahand Kashani, Stuart Byma, James Larus
22