Bilingual Terminology Mining - IIT Bombay

Bilingual Terminology Mining G U I D E : P R O F. P U S H PA K B H AT T A C H A R Y YA B Y: M U N I S H M I N I A ( 0 7 D 0 5 0 1 6 ) P R I YA N K S H A R M A ( 0 7 D 0 5 0 1 7 ) Content Introduction Multilingual Terminology Mining Chain Term Extraction Term Alignment Direct Context-Vector Method Translation of Lexical Units Linguistic Resources

Comparable Corpora Bilingual Dictionary Conclusion Introduction Text mining research generally adopts big is beautiful approach Justified by the need of large amount of data in order to make use of statistic or stochastic methods[1] Hypothesis : The quality rather than the quantity of the corpus matters more in terminology mining Web is used as a Comparable Corpus

The comparability of the corpus should not only be based on the domain or the sub-domain, but also on the type of discourse 1 : Manning and Schtze, 1999 Multi lingual Terminology Mining Chain Architecture : Source language document s WEB Terminology Extraction

Target language document s Terminology Extraction Lexical alignment process Terms to be translated Bilingual Dictionary Translated terms

Architecture A comparable corpora is taken as input Output is a list of single- and multi-word candidate terms along with their candidate translations Processes involved : Term Extraction Term Alignment Direct Context-Vector Method Translation of lexical units

Term Extraction Terminological units extracted are MW terms whose syntactic patterns, expressed using POS tags, correspond either to a canonical or a variation structure For French main patterns are NN N Prep N et N adj For Japanese main patterns are N N N Suff Adj N Pref N

Variants handled are Morphological for both French and Japanese Syntactical for French Compounding for Japanese Variants handled in French Morphological Variant : Morphological modification of one of the components of base form Syntactical Variant : the insertion of another word into the components of the base form Compounding Variant : the agglutination of another word to one of the components of base form Example : scrtion dinsuline (insulin secretion)

Base form : N Prep N pattern Morphological variant : scrtions dinsuline (insulin secretions) Syntactic variant : scrtion pancratique dinsuline (pancreatic insulin secretion) Syntactic variant : scrtions de peptide et dinsuline (insulin and peptide secretion) Variants handled in Japanese Example the MWT (insulin secretion) appears in following form:

Base form : N N pattern : Compounding variant : agglutination of a word at the end of the base form : (insulin secretion ability) Term Alignment It aligns source MWTs with target SWTs or MWTs Direct Context-Vector method: Collect all lexical units in the context of lexical unit i in a window of size n words around i For each lexical unit of source and target language

Obtain a context-vector Vi, which gathers the set of cooccurrences units j associated with the number of times that j and i occur together Normalize context vector Mutual information Log-likelihood Term Alignment: Direct Context-Vector Method

Using Bilingual Dictionary, translate the lexical units of the source context-vector For a word to be translated, compute the similarity between the translated context-vector and all target vectors through vector distance Candidate translations of a lexical units are the target lexical units closest to the translated context-vector acc. to the vector distance Term Alignment: Translation Translation of lexical units Depends on the coverage of bilingual dictionary If bilingual dictionary provides several translations for a lexical unit, consider all of them but weight the different translations by their frequency in the target language

For a MW, possible translations are generated by using compositional method If it is not possible to translate all compositions of MW, MWT is not taken into account in the translation process Composition methods for French and Japanese For Japanese Fatigue chronique (chronic fatigue) for fatigue four translations are possible : two translations for chronique: We generate all combinations[2] of translated elements and select those which refer to an existing MWT in the target language

For French For a multi-word of length n[3], produce all the combinations of MW Unit elements of length less than or equal to n Syndrome de fatigue chronique (chronic fatigue disease) yields the four possible combinations: [Syndrome de fatigue chronique] [Syndrome de fatigue] [chronique] [Syndrome] [fatigue chronique] [Syndrome] [fatigue] [chronique] A direct translation of subpart of the MW is done if present in bilingual dictionary

90% of the candidate terms provided by term extraction are composed of only two content words, so limiting to the combination 4th 2: Grefenstette, 1999 3: Robitaille et al.,2006 Comparable Corpora Are sets of texts in different languages, that are not translations of each other Share some characteristics or features : topic, period, media, author, discourse One of the clearest is ICE -- the International Corpus of English=1, ; (Greenbaum1991) Corpora of around one million words in each of many varieties of English around the world Bilingual Dictionary

A bilingual dictionary or translation dictionary is a specialized dictionary used to translate words or phrases from one language to another. Bilingual dictionaries can be Unidirectional, meaning that they list the meanings of words of one language in another Bidirectional, allowing translation to and from both languages. Conclusion More frequent a term and its translation, the

better is the quality of alignment The discourse categorization of documents allows lexical acquisition to increase precision Including discourse, results in candidate translations of better quality even if the corpus size is reduced by half Gives rise to data sparsity problem Data sparsity problem can be partially solved by using comparable corpora of high quality

References http://en.wikipedia.org/wiki/Bilingual_diction ary http://www.ilc.cnr.it/EAGLES/corpustyp/node 21.html Bilingual terminology mining - using brain, not brawn comparable corpora [Emmanuel Morin, Beatrice Daille, Koichi Takeuchi, and Kyo Kageura 2007] Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain X. Saralegi, I. San Vicente, A. Gurrutxaga Thank You

Recently Viewed Presentations

  • Chapter 1

    Chapter 1

    The GPIO ports are used for interfacing devices such as LEDs, switches, LCD, keypad, and so on. Special purpose I/O: These I/O ports have designated function such as ADC (Analog-to-Digital), Timer, UART (universal asynchronous receiver transmitter), and so on.
  • Primary Sclerosing Cholangitis and Liver Transplantation

    Primary Sclerosing Cholangitis and Liver Transplantation

    Primary Sclerosing Cholangitis. and Liver Transplantation. Cary A. Caldwell MD. ... Cholangiocarcinoma CCA can develop at any time. ... Numbers at risk at each time point are indicated in the accompanying table. PSC Recurrence with ECD allograft.
  • 3.9 Small Rings Cyclopropane Cyclobutane Cyclopropane  sources of

    3.9 Small Rings Cyclopropane Cyclobutane Cyclopropane sources of

    3.9 Small Rings Cyclopropane Cyclobutane Cyclopropane sources of strain torsional strain angle strain Cyclobutane nonplanar conformation relieves some torsional strain angle strain present 3.10 Cyclopentane Cyclopentane all bonds are eclipsed in planar conformation planar conformation destabilized by torsional strain Nonplanar...
  • Quantum Numbers - scramlinged.com

    Quantum Numbers - scramlinged.com

    quantum. of energy is the amount of energy needed to move one electron. Neils Bohr was also a GK for the Danish National Team ! ... An Atom contains 4 quantum numbers. Schrodinger was always very serious about Quantum Numbers....
  • Senior Pathways - Cavendish Road State High School

    Senior Pathways - Cavendish Road State High School

    To achieve a QCE each student must: Achieve 20 credits from their learning. These are achieved through passing a semester of a school based subject or through nationally recognised certificate courses. Pass at least 1 semester of English and Maths.
  • Growth and Characterization of Solar Absorber Material ...

    Growth and Characterization of Solar Absorber Material ...

    It can be shown that for a high-quality solar cell (low RS and I0, and high RSH) the short-circuit current ISC is: */20 Graphically presentation: */20 Solar cell and panel I-V characteristic curves: */20 Fill factor, FF of a solar...
  • Secure Shell - Hochschule für Technik Rapperswil

    Secure Shell - Hochschule für Technik Rapperswil

    SSH - History. SSH version 1 was created in 1995 by Tatu Ylönen and first released under an open-source license. SSH quickly became a popular replacement for the insecure telnet protocol which doesn't offer server authentication and transmits the user...
  • Laboratory Glassware - Austin Community College

    Laboratory Glassware - Austin Community College

    Laboratory Glassware MLAB 1335 Immunology Serology Terry Kotrla, MS, MT(ASCP)BB Pipets (Pipettes) Laboratory instrument used to transport a measured volume of liquid. Three types of glass pipets used in the laboratory Volumetric Mohr Serological Mechanical Pipets Serological and Mohr pipets...