Word Recognition of Indic Scripts Naveen TS CVIT IIIT Hyderabad IIIT Hyderabad Introduction 22 official languages. 100+ languages. Language specific number system. Two major groups Indo Aryan Dravidian IIIT Hyderabad Optical Character Recognition IIIT Hyderabad OCR Challenges
Challenges due to text editors Different editors renders same symbol in different ways. Multiple fonts Poor/cheap printing technology IIIT Hyderabad Can cause degradations like Cuts/Merges Scanning quality IL Script Complexity Script complexity Matras, similar looking characters IIIT Hyderabad Samyuktakshar UNICODE re-ordering
Unicode re-ordering Final Output IIIT Hyderabad OCR Development challenges Word -> Symbol segmentation Presence of cuts/merges Development of a strong classifier Efficient post-processor Porting of technology for development of OCR for a new language. IIIT Hyderabad Motivation for this Thesis Avoiding the tough word->symbol
segmentation Automatic learning of latent symbol -> UNICODE conversion Common architecture for multiple languages Post-processor development challenges for highly inflectional languages. IIIT Hyderabad OCR DEVELOPMENT IIIT Hyderabad Recognition Architecture Large # Output Classes Huge training size Degradation impact minimal Word Recognizer
Small # Output Classes Moderate training size Degradation impact serious Symbol Recognizer IIIT Hyderabad 10.2.57.116 Limitation of Char recognition System Difficult to obtain annotated training samples Extracting symbols from words is tough. Inability to utilize all available training data Extremely difficult to extract all symbols from 5000 pages and annotate them.
Classifier output(Char) -> Required output(Word) conversion. Issues due to degradations (Cuts/Merges) etc. IIIT Hyderabad Holistic Recognition Word Annotation Word Text Word Image Word Recognition System IIIT Hyderabad To Evaluation System Evaluation Final Output
Input layer forward pass t+1 t Input sequence Features IIIT Hyderabad Importance of Context Small Context Larger Context For a given feature, BLSTM takes into account forward as well as backward context.
IIIT Hyderabad BLSTM for Devanagari Motivation No Zoning Word Recognition Handle large # classes IIIT Hyderabad Naveen Sankaran and C V Jawahar. Recognition of Printed Devanagari Text Using BLSTM Neural Network International Conference on Pattern Recognition(ICPR), 2012. BLSTM for Devanagari Feature Extraction Input Image BLSTM Network
Output Class Labels 35, 64, 55, 105 Class Label to Unicode conversion IIIT Hyderabad BLSTM Results Trained on 90K words and tested on 67K words. Obtained more than 20% improvement in Word Error Rate. Char. Error Rate Word Error Rate Devanagari
15.13 43.15 22.15 IIIT Hyderabad 1. D. Arya, et al., @ ICDAR MOCR Workshop, 2011. Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts Qualitative Results IIIT Hyderabad Limitations Symbol to UNICODE conversion rules are required to generate final output. Huge training time of about 2 weeks. IIIT Hyderabad Recognition as Transcription
Network learns how to Transcribe input features to output labels. Target labels are UNICODE No Symbol-> UNICODE output mapping Easily scalable to other languages IIIT Hyderabad Recognition Vs Transcription IIIT Hyderabad Challenges Segmentation free training and testing UNICODE (akshara) training and UNICODE (akshara) testing Practical Issues: Learning with memory: (symbol ordering in Unicode) Large output label space Scalability to large data set Efficiency in testing IIIT Hyderabad
Training time Training time increases when # Output classes increases # Features decreases # Training data increases IIIT Hyderabad Training at Unicode level UNICODE training largely reduces the number of classes. Language # Unicode # Symbols Malayalam 163
215 Tamil 143 212 Telugu 138 359 Kannada 156 352 UNICODE training can reduce the time taken IIIT Hyderabad
Features Each word split horizontally into two parts 7 features extracted from top and bottom half Sliding window of size 5pixel used. Binary Features Grey Features Mean Std. Deviation IIIT Hyderabad Variance Network Configuration IIIT Hyderabad
Learning rate of 0.0009 Momentum 0.9 Number of hidden layers = 1 Number of nodes in hidden layer = 100 IIIT Hyderabad Input t=0 Input layer Hidden Layer Output Layer . .
. CTC . . . LAYER Final Network Architecture UNICODE Output Evaluation & Results IIIT Hyderabad Dataset Annotated Multi-lingual Dataset (AMD) Annotated DLI dataset (ADD) 1000 Hindi pages from DLI
IIIT Hyderabad Language No. of Books No. of Pages Hindi 33 5000 Malayalam 31 5000
5.21 5.58 - 13.65 25.72 - IIIT Hyderabad 1. D. Arya, et al., @ ICDAR MOCR Workshop, 2011. Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts 2. https://code.google.com/p/tesseract-ocr/ Qualitative Results IIIT Hyderabad
Performance with Degradation Added Synthetic degradation to words and evaluated them. Degradation Level 1 Degradation Level 2 Degradation Level 3 IIIT Hyderabad Qualitative Results Unicode Rearranging IIIT Hyderabad Error Detection for Indian Languages IIIT Hyderabad
Error Detection : Why is it hard? Highly Inflectional UNICODE Vs Akshara Words can be joined to from another valid new word. IIIT Hyderabad Development Challenges Availability of large corpus Percentage of unique words IIIT Hyderabad Language Total Words Unique Words Average Word Length
Error Models for IL OCR Two type of errors generated by OCR Non-Word error Presence of impossible symbols between words. Caused due to recognition issues, Symbol -> UNICODE mapping issues etc. IIIT Hyderabad Error Models for IL OCR Two type of errors generated by OCR Real-Word error Caused when one valid symbol is recognized as another valid symbol. Mainly caused due to confusion among symbols IIIT Hyderabad Error Models for IL OCR
Percentage of words which gets converted to another word for a give Hamming distance. IIIT Hyderabad Error Detection Methods IIIT Hyderabad Using Dictionary Create a dictionary based on most frequently occurring words. Valid words are those which are present . Accuracy depends on dictionary coverage. Using akshara nGram Generate symbol (akshara) nGram based dictionary. Every word is converted to its associated nGrams. Dictionary generated using these nGrams. A word is valid if all nGrams are present in dictionary. Error Detection Methods Word and akshara dictionary combination First check if word is present in dictionary.
If not, check in the nGram dictionary. IIIT Hyderabad Detection through learning Use linear classification methods to classify a word as valid or invalid. nGram probabilities are chosen as features. Used SVM based binary classifier to train. This model was used to predict if a word was valid or not. Evaluation Matrix True Positive (TP) : Our model detect a word as Invalid and annotation seconds it False Positive(FP) : Our model detect a word as Invalid but is actually a valid word True Negative (FN) : Our model detects a word as Valid but is actually invalid word False Negative (TN) : Our model detects a word as Valid and annotation seconds it Precision, Recall and F-Score IIIT Hyderabad
Dataset British National Corpus for English and CIIL corpus for Indian Languages. Used OCR output from Arya et.al (J-MOCR, ICDAR 2011) for experiments. Took 50% wrong OCR outputs to train SVM with negative samples. Malayalam dictionary size of 670K words and Telugu dictionary size of 700K IIIT Hyderabad Results Method Malayalam Telugu TP FP
0.94 0.64 0.76 Word Dictionary + SVM 0.69 0.63 0.76 0.95 0.67 0.78 Table showing Precision, Recall and F-Score values for Malayalam and Telugu
Conclusion A generic OCR framework for multiple Indic Scripts. Recognition as Transcription. Holistic recognition with UNICODE output. High accuracy without any post-processing. IIIT Hyderabad Understanding challenges in developing postprocessor for Indic Scripts. Error detection using machine learning. Thank You !!!! IIIT Hyderabad
A teleological version of evolution? From Darwin's idea of natural selection we get the idea that the best adapted individuals in the face of competition for limited resources in the face of Malthusian pressure comes the idea of 'survival of...
Outline: Morphology. Morphemic segmentation. un + beat + able. Phonology ("morphonology") and orthography. bab. y + s = bab. ie. s. Inflectional vs. derivational morphology. Morphological analysis: word form lemma + morphosyntactic features (tag) Tagging (context-aware disambiguation) Unsupervised affix detection...
WAVE MECHANICS (Schrödinger, 1926) The currently accepted version of quantum mechanics which takes into account the wave nature of matter and the uncertainty principle. * The state of an electron is described by a function y, called the "wave function"....
Bridget Jones: os novos caminhos de Lizzy Bennet. Sabine Mendes Lima Moura, Dn. Cinemateca Literária. Universidade Veiga de Almeida. 1/2012 "Quando eu era jovem, nunca precisava de ninguém e fazer amor era só por diversão - esses dias acabaram"
Arial MS Pゴシック Calibri Chalkboard Symbol Monotype Sorts Office Theme Arterial Blood Gas Analysis What is an ABG? Why Order an ABG? Logistics Acid Base Balance Acid Base Balance The Terms Respiratory Acidosis Respiratory Acidosis Respiratory Alkalosis Respiratory Alkalosis Metabolic...
Terms. Meaning. appendix. A small, finger-like mass of lymphoid tissue attached to the first part of the large intestine. lymph. The thin plasma-like fluid that drains from the tissues and is transported in lymphatic vessels (root: lymph/o) lymph node
Ready to download the document? Go ahead and hit continue!