NCMMSC01 20-22 NOV 01, Shenzhen, China Mandarin Pronunciation

NCMMSC01 20-22 NOV 01, Shenzhen, China Mandarin Pronunciation

NCMMSC01 20-22 NOV 01, Shenzhen, China Mandarin Pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University Motivation In spontaneous speech, pronunciations of individual words are different, there are often Sound changes, and Phone changes. For Chinese Change includes insertion, deletion and substitution. an additional accent problem even people are speaking Mandarin, due to different dialect backgrounds (in Chinese, 7 major dialects) colloquialism, grammar, style Goal: modelling the pronunciation variations

Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to. Finding solutions to the pronunciation modelling theoretically and practically Center of Speech Technology, Tsinghua University Slide 2 Overview Authors Paper Source Database Method WER T. Fukada, Y. Sagisaka (ATR, Japan) Automatic generation of a pronunciation dictionary based on a pronunciation network (EuroSpeech97) Japanese Spontaneous ANN Prediction 75.54 % 67.44 % M-K Liu, Bo Xu (NLPR, China) Mandarin accent adaptation based on CI/CD

pronunciation modeling (ICASSP2000) Shanghai Accent (Intel) Confusion Matrix 45.13 % 40.24 % M. Saraclar (CLSP, JHU) H. Nock (CUED, Cam., UK) Pronunciation modeling by sharing Gaussian densities across phonetic models (EuroSpeech99) Switchboard Gaussian Sharing 50.10 % 48.70 % K. Ma, G. Zavaliagkos (GTE / BBN, USA) Pronunciation modeling for large vocabulary conversational speech recognition (ICSLP98) Switchboard Callhome Lexical Adaptation 54.60 % 53.49 % M. Riley (AT&T Labs) W. Byrne (CLSP, JHU)

Stochastic pronunciation modelling from hand-labelled phonetic corpora (Speech Communicaion, 1999 (29)) TIMIT + ICSI Decision Tree 44.66 % 44.05 % D. Povey, P.C. Wooland (CUED, Cambridge, UK) Improved discriminative training techniques for large vocabulary continuous speech recognition (ICASSP2001) NAB, Switchboard Discriminant Training 46.60 % 44.30 % T. Hain, P.C. Woodland (CUED, Cambridge, UK) New features in the cu-htk system for transcription of conversational telephone speech (ICASSP2001) NIST Hub5E (Telephone) VTLN MMIE 51.60 % 47.00 %

Center of Speech Technology, Tsinghua University Slide 3 Necessity to establish a new annotated spontaneous speech corpus The existing databases (incl. Broadcast News, CallHome, CallFriend, ) do not cover all the Chinese spoken language phenomena Sound changes: voiced, unvoiced, nasalization, Phone changes: retroflexed, OOV-phoneme, The existing databases do not contain pronunciation variation information for use of bootstrap training A Chinese Annotated Spontaneous Speech (CASS) Corpus was established before WS00 on LSP in JHU Completely spontaneous (discourses, lectures, ...) Remarkable background noise, accent background, ... Recorded onto tapes and then digitalized Center of Speech Technology, Tsinghua University Slide 4 Chinese Annotated Spontaneous Speech (CASS) Corpus CASS w/ Five-Tier Transcription

Character level Syllable (or Pinyin) Level (w/ tone) Initial/Final (IF) Level SAMPA-C Level Miscellaneous Level : base form : base form : w/ time boundary for baseform : surface form : used for garbage modeling Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur (unclear), modal, smack, non-Chinese Example Character Syllable

wo3 men0 duo1 ren4 shi0 dian3 ren2 CASS Syllable wo3 men0 duo1 ren4 shi0 dianr3 ren2 IF uo m @_n t uo z` @_n

s` t z` @_n GIF uo @_n t_v uo z` @_n s`_v noise< noise> Misc i` iE_n t_v ia` z` @_n mum< mum> Center of Speech Technology, Tsinghua University Slide 5 SAMPA-C: Machine Readable IPA

Phonologic Consonants - 23 Phonologic Vowels -9 Initials - 21 Finals - 38 Retroflexed finals - 38 Tones and Silences Sound Changes Spontaneous Phenomenon Labels Center of Speech Technology, Tsinghua University Slide 6

Key Points in PM (1) Choosing and generating speech recognition unit (SRU) set Constructing a multi-pronunciation lexicon (MPL) So as to well describe the phone changes and sound changes Could be syllable, semi-syllable, or INITIAL/FINAL. A syllable-to-SRU lexicon to reflect the relation between the grammatical units and acoustic models Acoustically modeling spontaneous speech Theoretical framework CD modeling; confusion matrix; data-driven Center of Speech Technology, Tsinghua University Slide 7 Key Points in PM (2) Customizing decoding algorithm according to new lexicon

Improved time-synchronous search algorithm to reduce the path expansion (caused by CD modeling) A* based algorithm based tree-trellis search algorithm to score multiple pronunciation variations simultaneously in the path Modifying statistical language model W arg max P( X | W ) P (W ) W W arg max P ( X | V ) P (V ) W Baseform (V ) W arg max P ( X | V ) P (V | W ) P (W ) W Baseform (V ) Center of Speech Technology, Tsinghua University Slide 8 Establishment of Multi-Pron. Lexicon Two major approaches Defined by linguists and phonetists Data-driven: confusion matrix, rewritten rules, decision tree ... Our method: Find all possible pronunciations in SAMPA-C from database Reduce the size according to occurring frequencies Center of Speech Technology, Tsinghua University Slide 9

Surface form for IF and Syllable Learning pronunciations Definition of Generalized Initial-Finals (GIFs) z z z z e e e ts ts_v ts` ts`_v 7 7` @ chang

chang chang chang chang chang chang chang [0.7850] [0.1215] [0.0280] [0.0187] [0.0187] [0.0093] [0.0093] [0.0093] Collect all of them and choose the most frequent ones as GIFs. : canonical : voiced : changed to zh : changed to voiced zh : canonical : retroflexed or changed to er : changed Definition of Generalized Syllables (GSs) the lexicon Probabilistic lexicon. ts`_h AN ts`_h_v AN

ts`_v AN AN z` AN iAN ts_h AN ts`_h UN Center of Speech Technology, Tsinghua University Define them according to GIF set. P ( [GIFi] GIFf | Syllable ) Slide 10 Probabilistic Pronunciation Modeling Theory Recognizer goal LM P(A|K) = n P(an|kn) Pronunciation modeling part via introducing surface s

K*=argmaxK P(K|A) = argmaxK P(A|K) P(K) Applying independent assumption AM P(a|k) = s P(a|k,s) P(s|k) Symbols a: Acoustic signal, k: IF, s: GIF A, K, S: corresponding string Center of Speech Technology, Tsinghua University Refined AM Output Prob. Slide 11 Refined Acoustic Modeling (RAM) P(a|k, s) -- RAM It cannot be trained directly, the solutions could be: Use P(a|k) instead -- IF modeling Use P(a|s) instead -- GIF modeling Adapt P(a|k) to P(a|k, s)

-- B-GIF modeling Adapt P(a|s) to P(a|k, s) -- S-GIF modeling IF-GIF transcription should be generated from the IF and GIF Need more data, but the data amount is fixed Using adaptation IF ( Syllable) i1 f1 i2 f2 GIF ( SAMPA-C) gi1 gf1 gi2 gi3 Actual IF

i1 f1 i2 i3 IF-GIF i1-gi1 f1-gf1 i2-gi2 transcriptions i3 f3 gif4 gf3 if4 f3 f3-gf3 i3-gi3 if4-gif4

Center of Speech Technology, Tsinghua University Slide 12 Generate RAM via adaptation technology Adapt P(a|k) to P(a|k, s) -B-GIF scheme Adapt P(a|s) to P(a|k, s) -S-GIF scheme GIF1 GIF2 IF IF1 GIF3 IF2 GIF Center of Speech Technology, Tsinghua University IF3 Slide 13 Probabilistic Pronunciation Modeling Theory Recognizer goal

LM P(A|K) = n P(an|kn) Pronunciation modeling part (2/2) K*=argmaxK P(K|A) = argmaxK P(A|K) P(K) Applying independent assumption AM P(a|k) = s P(a|k,s) P(s|k) Symbols a: Acoustic signal, k: IF, s: GIF A, K, S: corresponding string Center of Speech Technology, Tsinghua University Refined AM Output Prob. Slide 14 Surface-form Output Probability Modeling (SOPM) P(s|k) - SOPM

Solution: Direct Output Prob. (DOP) learned from CASS Problem: data sparseness Idea: syllable level data sparseness DOESNT mean IF/ GIF level data sparseness New solution Context-Dependent Weighting (CDW): P(GIF|IF) = IFL P(GIF| (IFL, IF)) P (IFL| IF) P(GIF| (IFL, IF)): GIF output prob. given context P (IFL| IF):IF transition prob. Above two items can be learned from CASS Center of Speech Technology, Tsinghua University Slide 15 Generate SOPM via CDW P(S-Syl | B-Syl): CDW: B-Syl=(i, f), S-Syl = (gi, gf) P(S-Syl | B-Syl ) = P(gi | i) P(gf | f)

P(GIF|IF) = IFL P(GIF| (IFL, IF)) P (IFL| IF) Q(GIF|IF) = maxIFL P(GIF| (IFL, IF)) P (IFL| IF) ML (GIF|IF) = P(GIF| (L, IF)) P (L| IF) Different estimation of P(S-Syl | B-Syl) P(gi | i) P(gf | f) P(gi | i) Q(gf | f) P(gi | i) Mi(gf | f) Center of Speech Technology, Tsinghua University Slide 16 Can CDW be better (1) ? Pronunciation Lexicons Intrinsic Confusion Introduction of MPL useful for pronunciation variation modeling, but Enlarges the among syllable confusion

The recognition target: IF string What we actually get: GIF string Even the GIF recognizer achieves 100%, we cannot get the 100% IF string because of MPL Center of Speech Technology, Tsinghua University Slide 17 Can CDW be better (2) ? Pronunciation Lexicons Intrinsic Confusion To reflect syllable level intrinsic confusion extent Is the lower bound of syllable error rate PLIC ( L,W ) P ( s ) 1 max P (b | s ) bB sS CDW can reduce PLIC Center of Speech Technology, Tsinghua University Slide 18 PLIC (%) Can CDW be better (3) ? 3.5 3 2.5 2

1.5 1 0.5 CPPV (%) 0 75 80 85 90 95 100 EOP 0.42 0.6 0.66 1.52 3.34 12.29 DOP 0.41 0.51 0.54 0.83

1.16 1.73 CDW-M 0.26 0.27 0.28 0.36 0.52 0.86 Center of Speech Technology, Tsinghua University Slide 19 Experiment condition CASS Corpus was used for the experiment Feature Training Set: 3 hours data Testing Set: 15 minutes data MFCC + + + E (with CMN)

HTK Accuracy calculated based on syllable %Acc = Hit / Num * 100% %Cor = (Hit Ins) / Num * 100% Center of Speech Technology, Tsinghua University Slide 20 Experimental results CDW GIF modeling P(a|s) Baseline P(a|s) P(s|b) SER 5.1% PronModeling SER 2.8% DOP + S-GIF B-GIF P(a|b, s) 1 SER 3.6% SER 3.2% P(a|k, s) P(s|b) SER 6.3% Syl N-Gram (from BN/CASS) P(B) P(A|B) P(B) SER 10.7% Center of Speech Technology, Tsinghua University

Slide 21 Question ? Does it work when more data without phonetic transcription is available ? Center of Speech Technology, Tsinghua University Slide 22 Using more data w/o IF transcription A question: is the above method useful when only a small amount of data with IF transcription is available? The answer depends on how we use the data w/o IF transcription. Two parts of data: Seed database: that w/ phonetic transcription Extra database: that w/o phonetic transcription Center of Speech Technology, Tsinghua University Slide 23 Whats the purpose of these two databases? Seed Database

To define the SRU set (surface form) To train initial acoustic models To train initial CDW weights Extra Database To refine the existing acoustic models To refine the CDW weights Center of Speech Technology, Tsinghua University Slide 24 How to use extra database? The problem is that extra database contains only higher level transcriptions (say syllable instead of IF) An algorithm is needed to generate the phonetic level (IF level) transcription Our solution is the iterative forced-alignment based transcription (IFABT) algorithm Center of Speech Technology, Tsinghua University Slide 25 Steps for IFABT (1)

Use the forced-alignment technique and the MPL to decode both the seed database and the extra database Use these two databases with IF-GIF transcription To generate IF-GIF transcription under the constraints of previous canonical syllable level transcription To redefine MPL To retrain CDW weights To retrain IF-GIF models The above two steps will be repeated until satisfying. Center of Speech Technology, Tsinghua University Slide 26 Steps for IFABT (2) Canonical syllable transcription bt-1 HVITE bt+1 bt Forced-Alignment g in1 g

inKn g it-1 g ft-1 g it g ft g it+1 g ft+1 Surface form INITIAL/FINAL (GIF) transcription Multi-Pronunciation Lexicon ... SYLn wn1 ... SYLn wnKn ... g fn1 g

fnKn Acoustic Model Automatic generation of the GIF transcription for a database with only the syllable transcription. Center of Speech Technology, Tsinghua University Slide 27 Experiments done on CASS-II (1) Database Enlarge the database: from 3 hrs 6 hrs, to cover more spontaneous phenomena, and provide more training data The additional 3 hrs data are transcribed only in the canonical syllable level Center of Speech Technology, Tsinghua University Slide 28 Experiments done on CASS-II (2)

CI-IF 30.48 # 32.81 CD-IF 42.61 # 44.98 + Sharing 43.27 # 45.12 IF-GIF 31.84 # 34.44 CI-GIF 29.14 # 33.18 CD-GIF 42.44 # 46.07 + Sharing 43.11 # 46.30 CDW-M 33.39 # 35.85 CDW-M 32.76 # 34.28 CDW-M 43.46 # 46.92 CASS-I # CASS-I/-II Center of Speech Technology, Tsinghua University Slide 29 Summary

An annotated spontaneous speech corpus is important At the syllable level, the use of GIFs as acoustic models always achieves better results than IFs. Either the context dependent modeling or the Gaussian density sharing is a good method for pronunciation variation modeling. The context-dependent weighting is more useful than the Gaussian density sharing for pronunciation modeling, because it can reduce MPL's PLIC value. The IFABT method is helpful when more data with higher level transcription yet without the phonetic transcription is available. Center of Speech Technology, Tsinghua University Slide 30 References 1. 2. 3. 4. 5. Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, Mandarin Pronunciation Modeling Based on CASS Corpus, Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar Ruhi, Pronunciation modeling of Mandarin casual speech, Workshop 2000 on Speech and Language Processing: Final Report for MPM Group, http:// www.clsp.jhu.edu/index.shtml Zhanjiang Song, Research on pronunciation modeling for spontaneous Chinese speech recognition, Ph.D. Dissertation: Tsinghua University, Beijing, China, Apr. 2001 Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, Modeling

Pronunciation Variation Using Context-Dependent Weighting and B/S Refined Acoustic Modeling, EuroSpeech, 1:57-60, Sept. 3-7, 2001, Aalborg, Denmark Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, MANDARIN PRONUNCIATION MODELING BASED ON CASS CORPUS, to appear in J. Computer Science & Technology Center of Speech Technology, Tsinghua University Slide 31 Announcement (1) ISCA Tutorial and Research Workshop on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology September 14-15, 2002, Colorado, USA http://www.clsp.jhu.edu/pmla2002/ Center of Speech Technology, Tsinghua University Slide 32 Announcement (2) International Joint Conference of SNLP-O-COCOSDA May 9-11, 2002, Prachuapkirikhan, Thailand http://kind.siit.tu.ac.th/snlp-o-cocosda2002/ or http://www.links.nectec.or.th/itech/snlp-o-cocos da2002/ Center of Speech Technology, Tsinghua University Slide 33 Thanks for

listening Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University

Recently Viewed Presentations

  • VicNet Self-Scheduling Instructions To begin with the selfscheduling

    VicNet Self-Scheduling Instructions To begin with the selfscheduling

    To remove yourself from a shift, click on the "My Schedule" tab again, then click "Remove me". If you have any additional questions, please refer to the VicNet FAQ document in your email, or contact Volunteer Services at 425-261-4580 or...
  • Ca Ppt Template &quot;Title Slide&quot; Layout

    Ca Ppt Template "Title Slide" Layout

    Enter the Site ID provided to you in the Data View. Once at the Scantron site, enter the Site ID provided to you in the Data View and click "Next". ... If you experience any technical issues, contact the Student...
  • &quot;The Boxer&quot; by Emma Payne

    "The Boxer" by Emma Payne

    Duffy considers and explores the sense of isolation and confusion felt when as a child her parents moved from the Gorbals in Glasgow to England Important points to consider when reading the poem Form and Structure Like much of Duffy's...
  • PowerPoint-Präsentation

    PowerPoint-Präsentation

    oclc/260050165. Treffpunkt Paris! : Russlands Künstler zwischen Cezannismus und lyrischer Abstraktion ; Ludwig Museum im Deutschherrenhaus Koblenz, 5. September bis 23. November 2003. Automatic Analysis of DDC Numbers based on MARC21 (ul, 16-04-25)
  • Disease, Surgical , Procedural and Observational Abbreviations ADR

    Disease, Surgical , Procedural and Observational Abbreviations ADR

    F-A fecal analysis. FB foreign body. FD feline distemper (panleukopenia) FeLV feline leukemia virus. ... NVL no visible lesions. OCD osteochondritisdesiccans. OFA Orthopedic Foundation for Animals. OHE ovariohysterectomy (spray)
  • NRC comments and questions in advance EVS19-E1TP-0600 JASIC

    NRC comments and questions in advance EVS19-E1TP-0600 JASIC

    It is best practice to remove a small volume of resin parts in the immediate area around the heating element to avoid this issue. Question: Can you please share experience of performing nail tests on thin prismatic cells in a...
  • 22 February Todays Bell Ringer Evolution Define these

    22 February Todays Bell Ringer Evolution Define these

    Please define these words: Fossils. Molecular & genetic evidence (pg 317) Biogeography. Embryology. Homologous structures. ... Eukaryotic cells may have evolved through endosymbiosis. Endosymbiosis is a relationship in which one organism lives within the body of another.
  • AHRC Connected Communities - Newcastle University

    AHRC Connected Communities - Newcastle University

    Date. Theme. Speakers. April 1st. What does resilience mean? Community-led design. Dr Gill Windle, Bangor University. Rose Gilroy, Newcastle University. Sophia de Sousa, Glass-House Design