Automatic Extraction of Gene and Protein Synonyms from ...

Automatic Extraction of Gene and Protein Synonyms from ...

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia University, New York, USA {hongyu, eugene} 212-939-7028 Significance and Introduction Genes and proteins are often associated with multiple names Apo3, DR3, TRAMP, LARD, and lymphocyte associated receptor of death Authors often use different synonyms Information extraction benefits from identifying those synonyms Synonym knowledge sources are not complete Developing automate approaches for

identifying gene/protein synonyms from literature Background-synonym identification Semantically related words Distributional similarity [Lin 98][Li and Abe 98][Dagan et al 95] beer and wine Mapping abbreviations to full forms Map LARD to lymphocyte associated receptor of death drink, people, bottle and make [Bowden et al. 98] [Hisamitsu and Niwa 98] [Liu and Friedman 03] [Pakhomov 02] [Park and Byrd 01] [Schwartz and Hearst 03] [Yoshida et al. 00] [Yu et al. 02]

Methods for detecting biomedical multiword synonyms Sharing a word(s) [Hole 00] cerebrospinal fluid cerebrospinal fluid protein assay Information retrieval approach Trigram matching algorithm [Wilbur and Kim 01] Vector space model cerebrospinal fluidcer, ere, , uid cerebrospinal fluid protein assaycer,ere, , say Background-synonym identification GPE [Yu et al 02]

A rule-based approach for detecting synonymous gene/protein terms Manually recognize patterns authors use to list synonyms Extract synonym candidates and heuristics to filter out those unrelated terms Apo3/TRAMP/WSL/DR3/LARD ng/kg/min Advantages and disadvantages High precision (90%) Recall might be low, expensive to build up BackgroundMachinelearning

Machine-learning reduces manual effort by automatically acquiring rules from data Unsupervised and supervised Semi-supervised Bootstrapping [Hearst 92, Yarowsky 95] [Agichtein and Gravano 00] Hyponym detection [Hearst 92] The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string. A Bambara ndang is a kind of bow lute Co-training [Blum and Mitchell 98] Method-Outline Machine-learning

Unsupervised Similarity [Dagan et al 95] Semi-supervised Bootstrapping Supervised SNOWBALL [Agichtein and Gravano 02] Support Vector Machine Comparison between machine-learning and GPE Combined approach Method--Unsupervised

Contextual similarity [Dagan et al 95] Hypothesis: synonyms have similar surrounding words N freq(t , w) Mutual information I (t , w) log 2 d freq ( t ) freq ( w ) min( I ( w, t1), I (W , t 2)) min( I (t1, w), I (t 2, w)

Similarity wlexicon sim(t1, t 2) wlexicon max( I ( w, t1), I ( w, t 2)) max( I (t1, w), I (t 2, w)) Methodssemi-supervised SNOWBALL [Agichtein and Gravano 02] Bootrapping Starts with a small set of user-provided seed tuples for the relation, automatically generates and evaluates patterns for extracting new tuples. {Apo3, DR3} {LARD, Apo3} {DR3, LARD} Apo3, also known as DR3 DR3, also called LARD

, also called , also known as Method--Supervised Support Vector Machine State-of-the-art text classification method Training sets: SVMlight The same sets of positive and negative tuples as the SNOWBALL Features: the same terms and term weights used by SNOWBALL Kernel function

Radial basis kernel (rbf) kernel function MethodsCombined Rational Machine-learning approaches increase recall The manual rule-based approach GPE has a high precision with lower recall Combined will boost both recall and precision Method Assume each system is an independent predictor Prob=1-Prob that all systems extracted incorrectly Evaluation-data Data

GeneWays corpora [Friedman et al 01] 52,000 full-text journal articles Preprocessing Gene/Protein name entity tagging Abgene [Tanabe and Wilbur 02] Segmentation Science, Nature, Cell, EMBO, Cell Biology, PNAS, Journal of Biochemistry SentenceSplitter Training and testing 20,000 articles for training

Tuning SNOWBALL parameters such as context window, etc. 32,000 articles for testing Evaluation-matrices Estimating precision Randomly select 20 synonyms with confident scores (0.0-0.1, 0.1-0.2, ,0.9-1.0) Biological experts judged the correctness of synonym pairs Estimating recall SWISSPROTGold Standard 989 pairs of SWISSPROT synonyms co-appear in at

least one sentence in the test set Biological experts judged 588 pairs were indeed synonyms and cdc47, cdc21, and mis5 form another complex, which relatively weakly associates with mcm2 Results Patterns SNOWBALL found Middle Conf Left 0.75 0.54 0.47 - <(0.55> - - <( 0.54> Right - Of 148 evaluated synonym pairs,

62(42%) were not listed as synonyms in SWISSPROT Results 1 Snowball SVM Similarity GPE Combined recall 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 score 0.6

0.7 0.8 0.9 Results precision 1 Snowball 0.8 SVM 0.6 GPE Combined 0.4 0.2 0 0.1 0.2

0.3 0.4 0.5 0.6 recall 0.7 0.8 0.9 Results System performance System Tagging Similarity Snowball SVM GPE Time 35 mins 7 hs 40 mins 2 hs

1.5 h Conclusions Extraction techniques can be used as a valuable supplement to resources such as SWISSPROT Synonym relations can be automated through machine-learning approaches SNOWBALL can be applied successfully for recognizing the patterns

Recently Viewed Presentations

  • FIGURE 1-1 Figure text here.

    FIGURE 1-1 Figure text here.

    Teachers communicate congruently when they: Express "sane" messages Accept rather than deny feelings Avoid the use of labels Use praise with caution Elicit cooperation Communicate anger The Applied Behavioral Analysis Tradition in Classroom Management This approach is closely linked to...
  • Workshop Presentation - Overseas Development Institute

    Workshop Presentation - Overseas Development Institute

    Clinical oral implants research. Clinical transplantation. Community dentistry and oral epidemiology. Contact dermatitis. Danish medical bulletin. Dental traumatology: official publication . of International Association for Dental . Traumatology. Dermatologic therapy. European journal of dental education: official journal of the Association...
  • College of the Sciences and Mathematics Student Recognition

    College of the Sciences and Mathematics Student Recognition

    CSM 2019 Research Award Winners. Pictured: Erin Walsh, Priyatharsini Selvarathinam, Veronica Still & Colleen Kilcoyne. Not Pictured: Abbas Husain, Jaide Wible, Angela Zaccone, Dena Elansari, Hannah Crespy, Kelly Daudert, Brooke Miller, Caroline Guzi, Kelly Bradley, Kelsey Blum & Michael Powers
  • C


    What's the precise claim that can be proved by induction on string w? Claim: For all strings w, d*(q. 0, w) = (d 1 *(q 01, w), d 2 *(q 02, w))Prove claim by induction on w. It follows that...


    Arial 宋体 Calibri Showcard Gothic Microsoft YaHei Kozuka Gothic Pr6N B Haettenschweiler Monotype Corsiva Mistral 默认设计模板 自定义设计方案_2 自定义设计方案_3 默认设计模板_2 PowerPoint Presentation PowerPoint Presentation Activity #1 Goblin GOO! Activity #1 Goblin GOO! Activity #1 Goblin GOO!
  • GSC-18 Contribution Template

    GSC-18 Contribution Template

    GSC(14)18_038 (for information) Source: ETSI Contact: Markus Mueck, Chairman TC RRS Agenda Item: 7.8 ETSI TC RRS (Reconfigurable Radio Systems) Activities related to M2M/IoT and Public Safety/Security Communications
  • Graphs Rosen, Chapter 8 Isomorphism (Rosen 560 to

    Graphs Rosen, Chapter 8 Isomorphism (Rosen 560 to

    Graphs Rosen, Chapter 8 hc & tsp Some other gt problems Dominating set, feedback vertex set, minimum maximal matching, partitioning into triangles, partitioning into cliques, partitioning into perfect matchings, covering by cliques, HC, bandwidth, subgraph isomorphism, largest common subgraph, graph...
  • Cell Structure and Function

    Cell Structure and Function

    Over time, the nuclear DNA took over cell function and the organelle DNA became a remnant of their ancestors. ... See Chart on page 183 - "A Comparison of Cells" ... Cell Structure and Function Last modified by: