Protein Sequence Analysis - Bioinformatics

Protein Sequence Analysis - Bioinformatics

Protein Sequence Motifs Aalt-Jan van Dijk Plant Research International, Wageningen UR Biometris, Wageningen UR [email protected] www.bioinformatics.nl Plant Bioinformatics Genomics Next Generation Sequencing Genome assembly & annotation (Comparative) genome analysis SNP analysis, marker development Integrated analysis of omics datasets Transcriptomics

Computational infrastructure Database development Webbased analysis tools Software- development Workflow management systems machine learning Proteomics Technology Data (pre-)processing pipelining Alternative splicing Protein interactions networks Metabolomics

Alternative splicing EST analysis Database- development Data (pre-)processing pipelining Metabolite and pathway-identification Systems biology network modelling (bottom-up) Protein interactions networks www.bioinformatics.nl www.bioinformatics.nl My research Protein complex structures Protein-protein docking Correlated mutations Interaction site prediction/analysis

Protein-protein interactions Protein-DNA interactions Motif search Enzyme active sites www.bioinformatics.nl www.bioinformatics.nl Overview Protein Motif Searching Hydrophobicity & Transmembrane Domains Protein Interactions Sequence-motifs to predict interaction sites Secondary Structure Prediction www.bioinformatics.nl

www.bioinformatics.nl Protein Motif Searching www.bioinformatics.nl What is a motif? A motif is a description of a particular element of a protein that contains a specific sequence pattern Motifs are identified by 3D structural alignment Multiple sequence alignment Pattern searching programs www.bioinformatics.nl www.bioinformatics.nl What is a motif?

A motif is a description of a particular element of a protein that contains a specific sequence pattern Motifs are identified by 3D structural alignment Multiple sequence alignment Pattern searching programs www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Strict consensus pattern use only strictly conserved residues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C

P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Strict consensus pattern use only strictly conserved residues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching

Strict consensus pattern use only strictly conserved residues But what about: variable residues? gaps? C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching

Strict consensus patterns contain no alternative residues no flexible regions no mismatches no gaps C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C CxxxxxCxxxPxxxxxC C C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Most motifs defined as regular expressions Motifs can contain alternative residues flexible regions

C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C CXXXCXGXPXXXXXC | | | | | FGCAKLCAGFPLRRLPCFYG www.bioinformatics.nl www.bioinformatics.nl The PROSITE Syntax A-[BC]-X-D(2,5)-{EFG}-H A B or C anything 2-5 Ds not E, F, or G H www.bioinformatics.nl www.bioinformatics.nl PROSITE entries

Mandatory motifs characterise a protein (super-) family ID SUBTILASE_ASP; PATTERN. DE Serine proteases, subtilase family, aspartic acid active site. PA [STAIV]-x-[LIVMF]-[LIVM]-D-[DSTA]-G-[LIVMFC]-x(2,3)-[DNH]. ID SUBTILASE_HIS; PATTERN. DE Serine proteases, subtilase family, histidine active site. PA H-G-[STM]-x-[VIC]-[STAGC]-[GS]-x-[LIVMA]-[STAGCLV]-[SAGM]. ID SUBTILASE_SER; PATTERN. DE Serine proteases, subtilase family, serine active site. PA G-T-S-x-[SA]-x-P-x(2)-[STAVC]-[AG]. www.bioinformatics.nl www.bioinformatics.nl Exercise Find the three subtilase motifs in prosite (prosite.expasy.org) Compare the lists of proteins in which the motifs occur what does this tell you? Similarly, compare protein structures in which the motifs occur

Have a look at the sequence logo www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Some motifs occur frequently in proteins; they may not actually be present, such as Post-translational ID DE PA modification sites ASN_GLYCOSYLATION; PATTERN. N-glycosylation site. N-{P}-[ST]-{P}. www.bioinformatics.nl www.bioinformatics.nl Exercise Use a glycosylation site predictor such as

http://www.cbs.dtu.dk/services/NetNGlyc/ Input: your favorite set of sequences Do you observe that some N-{P}-[ST] sites are likely to be glycosylated and others not? www.bioinformatics.nl www.bioinformatics.nl Profiles Many motifs cannot be easily defined using simple patterns Such motifs can be defined using profiles A profile is constructed from a multiple sequence alignment. For each position, each amino acid is given a score depending on how likely it is to occur www.bioinformatics.nl

www.bioinformatics.nl Calculating a Profile For each alignment position: take the (weighted) average of the appropriate rows from the scoring matrix An (extremely simple) example: www.bioinformatics.nl seq_01 seq_02 seq_03 seq_04 seq_05 seq_06 seq_07 seq_08 seq_09 seq_10 A A A A

A A A A A A A A A A A A A A A W A A A A A A A A W W A A

A A A A A W W W A A A A A A W W W W A A A A A W W W W W

A A A A W W W W W W A A A W W W W W W W A A W W W W W W W

W A W W W W W W W W W W W W W W W W W W W www.bioinformatics.nl Excerpt from the EBLOSUM62 matrix: A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3

A 4.0 N -2.0 C 0.0 P -1.0 D -2.0 Q -1.0 E -1.0 R -1.0 F -2.0 S 1.0 G 0.0 T 0.0 H

-2.0 V 0.0 I -1.0 W -3.0 K -1.0 Y -2.0 L -1.0 M -1.0 A 5A+5W: 1.0 N -6.0 C -2.0 P -5.0 D

-6.0 Q -3.0 E -4.0 R -4.0 F -1.0 S -2.0 G -2.0 T -2.0 H -4.0 V -3.0 I -4.0 W 8.0 K -4.0

Y 0.0 L -3.0 M -2.0 A -3.0 N -4.0 C -2.0 P -4.0 D -4.0 Q -2.0 E -3.0 R -3.0 F 1.0

S -3.0 G -2.0 T -2.0 H -2.0 V -3.0 I -3.0 W 11.0 K -3.0 Y 2.0 L -2.0 M -1.0 10A:

10W: prophecy (EMBOSS), using Henikoff profile type, and BLOSUM62 matrix; www.bioinformatics.nl www.bioinformatics.nl Pattern Searching Short linear motifs: e.g. http://dilimot.russelllab.org/ Profiles: meme http://meme.sdsc.edu/meme/cgi-bin/ meme.cgi www.bioinformatics.nl www.bioinformatics.nl Exercise Use a number of sequences wich contain the prosite subtilase motif and find motifs in those sequences with MEME www.bioinformatics.nl

www.bioinformatics.nl Hydropathy Plot Prediction hydrophobic and hydrophilic regions in a protein www.bioinformatics.nl Partition Coefficients Hydrophilic Hydrophobic Oil Water www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity/Hydrophilicity Values hydrophilic hydrophobic R K D Q

N E H S T P Y C G A M W L V F I Fauchere & Pliska -1.37 -4.50 -1.35 -3.90 -1.05 -3.50 -0.78 -3.50 -0.85 -3.50 -0.87 -3.50 -0.40

-3.20 -0.18 -0.80 -0.05 -0.70 0.12 -1.60 0.26 -1.30 0.29 2.50 0.48 -0.40 0.62 1.80 0.64 1.90 0.81 -0.90 1.06 3.80 1.08 4.20 1.19 2.80 1.38 4.50 www.bioinformatics.nl Kyte & Doolittle

3.00 -2.53 3.00 -1.50 3.00 -0.90 0.20 -0.85 0.20 -0.78 3.00 -0.74 -0.50 -0.40 0.30 -0.18 -0.40 -0.05 0.00 0.12 -2.30 0.26 -1.00 0.29 0.00 0.48 -0.50 0.62 -1.30 0.64 -3.40

0.81 -1.80 1.06 -1.50 1.08 -2.50 1.19 -1.80 1.38 Hopp & Woods Eisenberg www.bioinformatics.nl Hydrophobicity Plot Sum amino acid hydrophobicity values in a given window Plot the value in the middle of the window Shift the window one position i k 1 Hi Hn

2k 1 n i k www.bioinformatics.nl www.bioinformatics.nl Sliding Window Approach Calculate property for first sub-sequence Use the result (plot/print/store) Move to next residue position, and repeat www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot

1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1

0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Transmembrane Regions Rotation is 100 degrees per amino acid Climb is 1.5 Angstrom per amino acid residue www.bioinformatics.nl www.bioinformatics.nl Transmembrane Regions 30 angstrom

www.bioinformatics.nl So we need approx. 30 / 1.5 = 20 amino acids to span the membrane www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl Adapting Adaptingthe thewindow windowsize sizeto to the thesize sizeof ofthe themembrane membrane spanning spanningsegment segmentmakes makesthe the picture

pictureeasier easierto tointerpret interpret www.bioinformatics.nl www.bioinformatics.nl window = 1 window = 9 window = 19 window = 121 www.bioinformatics.nl www.bioinformatics.nl Protein Interactions www.bioinformatics.nl Protein Interactions hemoglobin Obligatory www.bioinformatics.nl www.bioinformatics.nl

Protein Interactions hemoglobin Obligatory www.bioinformatics.nl Mitochondrial Cu transporters Transient www.bioinformatics.nl Experimental approaches (1) Yeast two-hybrid (Y2H) www.bioinformatics.nl www.bioinformatics.nl Experimental approaches (2) Affinity Purification + mass spectrometry (APMS) www.bioinformatics.nl www.bioinformatics.nl Interaction Databases STRING

http://string.embl.de/ www.bioinformatics.nl www.bioinformatics.nl Interaction Databases www.bioinformatics.nl www.bioinformatics.nl Interaction Databases STRING http://string.embl.de/ HPRD http://www.hprd.org/ www.bioinformatics.nl www.bioinformatics.nl Interaction Databases www.bioinformatics.nl www.bioinformatics.nl

Interaction Databases STRING http://string.embl.de/ HPRD http://www.hprd.org/ InteroPorc http://biodev.extra.cea.fr/interoporc/Default.aspx Many others. E.g. see http://nar.oxfordjournals.org./content/39/suppl_1.to c www.bioinformatics.nl www.bioinformatics.nl Yeast protein interaction network www.bioinformatics.nl www.bioinformatics.nl Sequence-based Protein Binding Site Prediction www.bioinformatics.nl

Binding site www.bioinformatics.nl www.bioinformatics.nl Binding site www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl

Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Motif search in groups of proteins Group proteins which have same interaction partner Use motif search, e.g. find PWMs Neduva Plos Biol 2005 www.bioinformatics.nl www.bioinformatics.nl Motif search in groups of proteins Group proteins which have same interaction partner Use motif search www.bioinformatics.nl www.bioinformatics.nl Correlated Motif Search www.bioinformatics.nl www.bioinformatics.nl Correlated Motif Search

Interactors AARLL PLTEQ MARLT DLTEP VVRLM MMTER Non-interactors AARLL MARLT VVRLM MARLT PLTEQ DLTEP Correlated Motif Pair: (RL,TE) www.bioinformatics.nl www.bioinformatics.nl Experimental validation Van Dijk et al, Plos Comp Biol 2010 www.bioinformatics.nl www.bioinformatics.nl New approach: slider Faster approach genome wide searching for interaction motifs Improve mining algorithm with a priori biological knowledge (conservation score, surface accessibility)

www.bioinformatics.nl www.bioinformatics.nl Boyen et al, IEEE/ACM Trans Comput Biol Bioinform. 2011 THE END.. Questions? www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction www.bioinformatics.nl Secondary Structure Prediction Traditional methods (statistical and/or rulebased) E.g.

Garnier, Osguthorpe & Robson Statistical method Accuracy ~ 60% www.bioinformatics.nl www.bioinformatics.nl GOR Helix Parameters i-8 Gly -5 ala 5 val 0 leu 0 ile 5 ser 0 thr 0 asp 0 glu 0 asn 0 gln 0 lys 20 his 10 arg 0 phe 0 tyr -5 trp -10

cys 0 met 10 pro -10 -10 10 0 5 10 -5 0 -5 0 0 0 40 20 0 0 -10 -20 0 20 -20 i-6 -15 15 0 10 15

-10 0 -10 0 0 0 50 30 0 0 -15 -40 0 25 -40 -20 20 0 15 20 -15 -5 -15 0 0 0 55 40 0 0

-20 -50 0 30 -60 i-4 i-2 -30 -40 -50 -60 30 40 50 60 0 0 5 10 20 25 28 30 25 20 15 10 -20 -25 -30 -35 -10 -15 -20 -25 -20 -15 -10 0 10 20 60 70 -10 -20 -30 -40 5 10 20 20 60 60 50 30 50 50 50 30 0 0 0 0 0 5 10 15 -25 -30 -35 -40

-50 -10 0 10 0 0 -5 -10 35 40 45 50 -80-100-120-140 www.bioinformatics.nl i -86 65 14 32 6 -39 -26 5 78 -51 10 23 12 -9 16 -45 12 -13 53 -77

-60 60 10 30 0 -35 -25 10 78 -40 -10 10 -20 -15 15 -40 10 -10 50 -60 i+2 -50 50 5 28 -10 -30 -20 15 78

-30 -20 5 -10 -20 10 -35 0 -5 45 -30 -40 40 0 25 -15 -25 -15 20 78 -20 -20 0 0 -30 5 -30 -10 0 40

-20 i+4 -30 30 0 20 -20 -20 -10 20 78 -10 -10 0 0 -40 0 -25 -50 0 35 -10 -20 20 0 15 -25 -15 -5

20 70 0 -5 0 0 -50 0 -20 -50 0 30 0 i+6 -15 15 0 10 -20 -10 0 15 60 0 0 0 0 -50 0 -15

-40 0 25 0 -10 10 0 5 -10 -5 0 10 40 0 0 0 0 -30 0 -10 -20 0 20 0 i+8 -5 5 0 0

-5 0 0 5 20 0 0 0 0 -10 0 -5 -10 0 10 0 www.bioinformatics.nl I S G A R N I E R H E L I X P R E D I C T i-8 Gly -5 ala 5 val 0 leu 0 ile 5 ser 0 thr 0 asp 0 glu 0 asn 0

gln 0 lys 20 his 10 arg 0 phe 0 tyr -5 trp -10 cys 0 met 10 pro -10 -10 10 0 5 10 -5 0 -5 0 0 0 40 20 0 0 -10 -20 0 20 -20

i-6 -15 15 0 10 15 -10 0 -10 0 0 0 50 30 0 0 -15 -40 0 25 -40 -20 20 0 15 20 -15 -5 -15

0 0 0 55 40 0 0 -20 -50 0 30 -60 i-4 i-2 -30 -40 -50 -60 30 40 50 60 0 0 5 10 20 25 28 30 25 20 15 10 -20 -25 -30 -35 -10 -15 -20 -25 -20 -15 -10 0 10 20 60 70 -10 -20 -30 -40 5 10 20 20 60 60 50 30 50 50 50 30

0 0 0 0 0 5 10 15 -25 -30 -35 -40 -50 -10 0 10 0 0 -5 -10 35 40 45 50 -80-100-120-140 www.bioinformatics.nl i -86 65 14 32 6 -39 -26 5 78 -51 10 23 12 -9

16 -45 12 -13 53 -77 -60 60 10 30 0 -35 -25 10 78 -40 -10 10 -20 -15 15 -40 10 -10 50 -60 i+2 -50 50 5

28 -10 -30 -20 15 78 -30 -20 5 -10 -20 10 -35 0 -5 45 -30 -40 40 0 25 -15 -25 -15 20 78 -20 -20 0 0

-30 5 -30 -10 0 40 -20 i+4 -30 30 0 20 -20 -20 -10 20 78 -10 -10 0 0 -40 0 -25 -50 0 35 -10 -20

20 0 15 -25 -15 -5 20 70 0 -5 0 0 -50 0 -20 -50 0 30 0 i+6 -15 15 0 10 -20 -10 0 15 60 0

0 0 0 -50 0 -15 -40 0 25 0 -10 10 0 5 -10 -5 0 10 40 0 0 0 0 -30 0 -10 -20 0 20 0

i+8 -5 5 0 0 -5 0 0 5 20 0 0 0 0 -10 0 -5 -10 0 10 0 www.bioinformatics.nl GOR Prediction beta sheet helix

www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction Recent methods Neural networks = flexible statistics Multiple alignments = variability Heuristics = common sense Or a combination of the above Accuracy ~ 70% www.bioinformatics.nl www.bioinformatics.nl Heuristics

Conserved parts are structurally and/or functionally important Segments with many gaps must be in loop regions www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction Strategy Use as many methods as possible Use homologous sequences Combine www.bioinformatics.nl predictions into consensus prediction www.bioinformatics.nl

Why cant it be 100% correct? All current 2D prediction schemes are based upon observation of occurrence of 2D elements in 3D structures Deduction of 2D elements from structures is ambiguous! DSSP, Stride, and the PDB (human) annotation do not always agree upon the assigned elements www.bioinformatics.nl www.bioinformatics.nl Do these residues still belong to the helix? www.bioinformatics.nl www.bioinformatics.nl

Recently Viewed Presentations

  • Unit: The Era of Jackson MYP Unit Question:

    Unit: The Era of Jackson MYP Unit Question:

    Was it really a "Corrupt Bargain"? Monday: What compromise combined the Virginia and New Jersey Plans to create the two houses of Congress? Was the election of 1824 rigged? John Quincy Adams, Andrew Jackson, Henry Clay. 2. No one had...
  • Discrete Mathematics

    Discrete Mathematics

    Directed Graph of Symmetric Relation. For a symmetric directed graph whenever there is an arrow going from one point of the graph to a second, there is an arrow going from the second point back to the first.
  • BMS Introduction to Research: Forensic DNA Profiling

    BMS Introduction to Research: Forensic DNA Profiling

    This profile was from a rape victim who was subsequently found to be unrelated to the Leskie case." Victorian Coroner's inquest into the death of Jaidyn Leskie "8. The match to the bib occurred as a result of contamination in...
  • ESE370: Circuit-Level Modeling, Design, and Optimization for Digital

    ESE370: Circuit-Level Modeling, Design, and Optimization for Digital

    First Order Delay R0 = Resistance of minimum size NMOS device I0 = Ids of minimum size NMOS device C0 = gate capacitance of minimum size NMOS device Rdrive = R0/Wn Idrive = WI0 Cg = WC0 Technology independent relative...
  • Counter-Arguments - Kelley's Kids

    Counter-Arguments - Kelley's Kids

    Counter-Arguments and Concessions Add this sheet to your YP, let's call its 28.33 and 28.66 What is a counter-argument? A source that disagrees with you. An argument against your thesis or some aspect of your reasoning. Why would you use...
  • LA MAESTRA THOMPSON - WordPress.com

    LA MAESTRA THOMPSON - WordPress.com

    Como la mayor parte de los profesores, ella miraba a sus alumnos y les decía que a todos los quería por igual. Pero eso no era posible, porque ahí en la primera fila, desparramado sobre su asiento, estaba un niño...
  • EHR Functionality - Northern Virginia Community College ...

    EHR Functionality - Northern Virginia Community College ...

    Special Topics in Vendor-Specific Systems Unit 4 EHR Functionality * * * * * * Outline EHR functionality Results Review Computerized Provider Order Entry (CPOE) Documentation Billing Messaging Component 14/Unit 4 Health IT Workforce Curriculum Version 1.0/Fall 2010 * Results...
  • Noble Gas Configuration - Dearborn Public Schools

    Noble Gas Configuration - Dearborn Public Schools

    Noble Gas Notation A shorthand method of writing electron configuration Noble Gases group 18 elements Noble gas configuration the outer main energy level is fully occupied by 8 electrons Steps to Write a Noble Gas Notation write down the symbol...