sifaka.cs.uiuc.edu

sifaka.cs.uiuc.edu

Analysis and Mining of Big Text Data: Opportunities, Challenges, and Applications ChengXiang Zhai ( ) Department of Computer Science University of Illinois at Urbana-Champaign USA Text data cover all kinds of topics Topics: People Events Products Services, Sources: Blogs Microblogs Forums Reviews , 45M reviews 53M blogs 65M msgs/day 1307M posts

115M users 10M groups = Humans as Subjective & Intelligent Sensors Real World Sense Weather Report Sensor Thermometer 3C , 15F, Geo Sensor Locations 41N and 120W . Network Sensor

Networks Perceive Data 01000100011100 Express Human Sensor 3 Unique Value of Text Data : Useful to all big data applications : Especially useful for mining knowledge about peoples behavior, attitude, and opinions Directly express knowledge about our world Small text data are also useful! Data Information Knowledge Text Data

Opportunities of Text Mining Applications 4. Infer other real-world variables (predictive analytics) + Non-Text Data 2. Mining content of text data Observed World Real World Text Data + Context Perceive Express (Perspective) (English) 3. Mining knowledge about the observer 1. Mining knowledge about language Challenges in Understanding Text Data (NLP)

Lexical analysis (part-of-speech tagging) A dog is chasing a boy on the playground Det Semantic analysis Noun Aux Det Noun Prep Det Noun Phrase Complex Verb Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). + Scared(x) if Chasing(_,x,_). Scared(b1) Inference Verb

Noun Phrase Noun Phrase Prep Phrase Verb Phrase Verb Phrase Sentence Noun Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back. Pragmatic analysis (speech act) NLP is hard! Natural language is designed to make human communication efficient. As a result, we omit a lot of common sense knowledge, which we assume the hearer/reader possesses. we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve.

This makes EVERY step in NLP hard Ambiguity is a killer! Common sense reasoning is pre-required. Examples of Challenges Word-level ambiguity: root has multiple meanings (ambiguous sense) Syntactic ambiguity: natural language processing (modification) A man saw a boy with a telescope. (PP Attachment) Anaphora resolution: John persuaded Bill to buy a TV for himself. (himself = John or Bill?) Presupposition: He has quit smoking implies that he smoked before. The State of the Art: Mostly Relying on Machine Learning A dog is chasing a boy on the playground Det Noun Aux Noun Phrase Complex Verb Semantics: some aspects - Entity/relation extraction

- Word sense disambiguation - Sentiment analysis Inference: ??? Verb Det Noun Prep Noun Phrase Det Noun POS Tagging: 97% Noun Phrase Prep Phrase Verb Phrase Parsing: partial >90%(?) Verb Phrase Sentence Speech act analysis: ???

Robust and general NLP tends to be shallow while deep understanding doesnt scale up. Grand Challenge: How can we leverage imperfect NLP to build a perfect application? Answer: Having humans in the loop! TextScope to enhance human perception TextScope( ) Microscope Telescope TextScope Interface & Major Text Mining Techniques Task Panel

TextScopeTopic Analyzer Search Box MyFilter1 MyFilter2 Opinion Prediction Event Radar Microsoft (MSFT,) Google, IBM (IBM) and other cloudcomputing rivals of Amazon Web Services are bracing for an AWS "partnership" announcement with VMware expected to be announced Thursday.

Select Time Select Region My WorkSpace Project 1 Alert A Alert B ... TextScope in Action: interactive decision support Predicted Values of Real World Variables Predictive Model Optimal Decision Making Sensor 1 Real World

Sensor k Non-Text Data Text Data Multiple Predictors (Features) Joint Mining of Non-Text and Text Application Example 1: Aviation Safety Predicted Values Predictive Model Anomalous events & ofcauses

Real World Variables Optimal Decision Making Multiple Predictors (Features) Aviation Administrator Sensor 1 Real World Aviation Safety Sensor k Non-Text Data Text Data

Joint Mining of Non-Text and Text Abundance of text data in the aviation domain Collecting reports since 1976 >860,000 reports as of Dec. 2009 Monthly intake has been increasing (4k reports/month) Slide source: http://asrs.arc.nasa.gov/overview/summary.html Lots of useful knowledge buried in text ASRS Report ACN: 928983 (Date: 201101, Time: 1801-2400. ) We were delayed inbound for about 2 hours and 20 minutes. On the approach there was ice that accumulated on the aircraft. The Captain wrote up The flight crew [who picked up the plane] the following morning notified us of an incorrect remark section write up. I believe a few years ago, there was a different procedure for writing up aborted takeoffs. I think there was some confusion as to what the proper write-up for the aborted takeoff was. A contributing factor for this incorrect entry into the log may have been fatigue. I had personally been awake for about 14 hours and still had another leg to do. Also a contributing factor is that this event does not happen regularly. A more thorough review and adherence to the operations manual section regarding aircraft status would have prevented this, [as well as], a better recognition of the onset of fatigue. The manual is sometimes so large that finding pertinent data is difficult. Even after it was determined that the event had occurred, it took

me 15 to 20 minutes to find the section regarding aborted takeoffs. Event Cube Analyst Analysis Support Multidimensional OLAP, Ranking, Cause Analysis, Topic Summarization/Comparison Topic Event Cube Representation Topic turbulence birds undershoot overshoot 98.02 98.01 99.02 99.01

LAX SJC MIA AUS Locatio n drilldown Encounter Deviation 1998 1999 FL TX roll-upLocation CA Multidimensonal Text Databasei Duo Zhang, ChengXiang Zhai, Jiawei Han, Ashok Srivastava, Nikunj Oza. Topic Modeling for OLAP on Multidimensional Text Databases: Topic Cube and its Applications, Statistical Analysis and Data Mining, Vol. 2, pp.378-395, 2009. Sample Topic Coverage Comparison Comparison of distributions of anomalies in FL, TX, and CA Improper Documentation (Florida)

Turbulance (Texas) Comparative Analysis of Shaping Factors Texas Florida Application Example 2: Medical & Health Predicted Values Diagnosis, optimal treatment ofSide Real World Variables effects of drugs, Predictive Model Optimal Decision Making Multiple Predictors (Features)

Doctors, Nurses, Patients Sensor 1 MedicalReal & Health World Sensor k Non-Text Data Text Data Joint Mining of Non-Text and Text Overview of Text Mining for Medical/Health Applications EHR (Patient How can we find similar medical

cases in medical literature, in Records)online forums, ? Medical Case Retrieval Improved Similar Medical Cases Health Medical Knowledge Discovery How can we analyze EHR to discover valuable medical knowledge (e.g., symptom evolution profile of a disease) from EHR? Medical Knowledge Care Medical Case Retrieval Query: Female patient, 25 years old, with fatigue and a swallowing disorder (dysphagia worsening during a meal). The frontal chest Xray shows opacity with clear contours in contact with the right heart border. Right hilar structures are visible through the mass. The lateral X-ray confirms the presence of a mass in the anterior mediastinum. On CT images, the mass has a relatively homogeneous tissue density.

Find all medical literature articles discussing a similar case We developed techniques to leverage medical ontology and Feedback to improve accuracy. The UIUC-IBM team was ranked #1 in ImageCLEF 2010Jimeng evaluation. Parikshit Sondhi, Sun, ChengXiang Zhai, Robert Sorrentino and Martin S. Kohn. Leveraging Medical Thesauri and Physician Feedback for Improving Medical Literature Retrieval for Case Queries, Journal of the American Medical Informatics Association , 19(5), 851-858 (2012). doi:10.1136/amiajnl-2011-000293. Extraction of Symptom Graphs from EHR EHR (Patient Records) Multi-Level Symptom Graphs Predict the future onset of a disease (e.g., Discovery of symptom Congestive Heart profiles Failure) for a patient of diseases Discovered symptoms improves accuracy of Parikshit Sondhi, Jimeng Sun, Hanghang Tong,

ChengXiang Zhai.by SympGraph: A Mining Framework of Clinical Notes prediction +10% through Symptom Relation Graphs, Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD12), pp. 1167-1175, 2012. Discovery of Adverse Drug Reactions from Forums Green: Disease symptoms Blue: Side effect symptoms Red: Drug Drug: Cefalexin ADR: panic attack faint . Sheng Wang et al. 2014. SideEffectPTM: an unsupervised topic model to mine adverse drug reactions from health forums. In ACM BCB 2014. Sample ADRs Discovered Drug(Fre q) Drug Use

Symptoms in Descending Order Zoloft (84) antidepressa nt weigh gain, weight, depression, side effects, mgs, gain weight, anxiety, nausea, head, brain, pregnancy, pregnant, headaches, depressed, tired Ativan (33) anxiety disorders Ativan, sleep, Seroquel, doc prescribed seroqual, raising blood sugar levels, anti-psychotic drug, diabetic, constipation, diabetes, 10mg, benzo, addicted Topamax (20)

anticonvulsan Topmax, liver, side effects, migraines, t headaches, weight, Topamax, pdoc, neurologist, supplement, sleep, fatigue, seizures, liver problems, kidney stones Ephedrine (2) stimulant dizziness, stomach, Benadryl, dizzy, tired, lethargic, tapering, tremors, panic attach, head Unreported to FDA Mining Traditional Chinese Medicine Patient Records Collaboration with Beijing TCM Data Center Clinical warehouse since 2007 More than 300,000 clinical cases from six hospitals Each hospital has ~ 3 million patient visits Two lines of work Subcategorization of patient records TCM knowledge discovery

Beijing TCM Data Center Subcategorization of Patient Records Edward W Huang, Sheng Wang, Runshun Zhang, Baoyan Liu, Xuezhong Zhou, and ChengXiang Zhai. PaReCat: Patient Record Subcategorization for Precision Traditional Chinese Medicine. ACM BCB, Oct. 2016. TCM Knowledge Discovery 10,907 patients TCM records in digestive system treatment 3,000 symptoms, 97 diseases and 652 herbs Most frequently occurring disease: chronic gastritis Most frequently occurring symptoms: abdominal pain and chills Ground truth: 27,285 manually curated herb-symptom relationship. Sheng Wang, Edward Huang, Runshun Zhang, Xiaoping Zhang, Baoyan Liu, Xuezhong Zhou, and ChengXiang Zhai , A Conditional Probabilistic Model for Joint Analysis of Symptoms, Diagnoses, and Herbs in Traditional Chinese Medicine Patient Records" , IEEE BIBM 2016. Top 10 herb-symptoms relationships Typical Symptoms of three Diseases

Typical Herbs for three Diseases Application Example 3: Business intelligence Predicted Values Predictive Model Business intelligence ofConsumer Real World Variables trends Optimal Decision Making Business analysts, Market researcher Sensor 1 Products Real World Sensor k Non-Text Data

Text Data Multiple Predictors (Features) Joint Mining of Non-Text and Text Motivation How to infer aspect ratings? How to infer aspect weights? Value Location Service Value Location

Service Hongning Wang, Yue Lu, ChengXiang Zhai. Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'10), pages 115-124, 2010. Solving LARA in two stages: Aspect Segmentation + Rating Regression Aspect Segmentation Reviews + overall ratings Observed Latent Rating Regression Aspect segments Term WeightsAspect RatingAspect Weight + location:1 amazing:1 walk:1 anywhere: 1 room:1 nicely:1

appointed:1 comfortable: 1 nice:1 accommodating: 1 smile:1 friendliness:1 attentiveness:1 0.0 2.9 0.1 0.9 0.1 1.7 0.1 3.9 2. 1 1. 2 1. 7 2. 2 3.

9 0. 2 4.8 0.2 5. 8 0. 6 Latent! Sample Result 1: Rating Decomposition Hotels with the same overall rating but different aspect ratings (All 5 Stars hotels, ground-truth in parenthesis.) Hotel Value Room

Location Cleanliness Grand Mirage Resort 4.2(4.7) 3.8(3.1) 4.0(4.2) 4.1(4.2) Gold Coast Hotel 4.3(4.0) 3.9(3.3) 3.7(3.1) 4.2(4.7) Eurostars Grand Marina Hotel 3.7(3.8)

4.4(3.8) 4.1(4.9) 4.5(4.8) Reveal detailed opinions at the aspect level Sample Result 2: Comparison of reviewers Reviewer-level Hotel Analysis Different reviewers ratings on the same hotel Reviewer Value Room Location Cleanliness Mr.Saturday 3.7(4.0) 3.5(4.0) 3.7(4.0)

5.8(5.0) Salsrug 5.0(5.0) 3.0(3.0) 5.0(4.0) 3.5(4.0) (Hotel Riu Palace Punta Cana) Reveal differences in opinions of different reviewers Sample Result 3:Aspect-Specific Sentiment Lexicon Value Rooms Location Cleanliness resort 22.80

view 28.05 restaurant 24.47 clean 55.35 value 19.64 comfortable 23.15 walk 18.89 smell 14.38 excellent 19.54 modern 15.82 bus 14.32 linen 14.25 worth 19.20 quiet 15.37 beach 14.11

maintain 13.51 bad -24.09 carpet -9.88 wall -11.70 smelly -0.53 money -11.02 smell -8.83 bad -5.40 urine -0.43 terrible -10.01 dirty -7.85 road -2.90 filthy -0.42 overprice -9.06

stain -5.85 website -1.67 dingy -0.38 Uncover sentimental information directly from the data Application 1: Discover consumer preferences Amazon reviews: no guidance battery life accessory service file format volume video Application 2: User Rating Behavior Analysis Expensive Hotel Cheap Hotel 5 Stars 3 Stars 5 Stars

1 Star Value 0.134 0.148 0.171 0.093 Room 0.098 0.162 0.126 0.121 Location 0.171 0.074

0.161 0.082 Cleanliness 0.081 0.163 0.116 0.294 Service 0.251 0.101 0.101 0.049 People like expensive hotels because of good service

People like cheap hotels because of good value Application 3: Personalized Recommendation of Entities Query: 0.9 value 0.1 others Non-Personalized Personalized Application Example 4: Prediction of Stock Market Predicted Values Market volatility StockWorld trends, Variables of Real Predictive Model Optimal Decision Making Multiple Predictors

(Features) Stock traders Sensor 1 Real World Events in Real World Sensor k Non-Text Data Text Data Joint Mining of Non-Text and Text Text Mining for Understanding Time Series What might have caused the stock market crash?

Time Sept 11 attack! Any clues in the companion news stream? Dow Jones Industrial Average [Source: Yahoo Finance] Stock-Correlated Topics in New York Times: June 2000 ~ Dec. 2011 AAMRQ (American Airlines) AAPL (Apple) russia russian putin europe european germany bush gore presidential police court judge airlines airport air united trade terrorism food foods cheese nets scott basketball tennis williams open awards gay boy moss minnesota chechnya paid notice st

russia russian europe olympic games olympics she her ms oil ford prices black fashion blacks computer technology software internet com web football giants jets japan japanese plane Topics are biased toward each time series Hyun Duk Kim, Malu Castellanos, Meichun Hsu, ChengXiang Zhai, Thomas A. Rietz, Daniel Diermeier. Mining causal topics in text data: iterative topic modeling with time series feedback, Proceedings of the 22nd ACM international conference on Information and knowledge management (CIKM 13), pp. 885-890, 2013. Causal Topics in 2000 Presidential Election Top Three Words in Significant Topics from NY Times tax cut 1 screen pataki guiliani enthusiasm door symbolic oil energy prices news w top pres al vice love tucker presented partial abortion privatization

court supreme abortion gun control nra Text: NY Times (May 2000 - Oct. 2000) Time Series: Iowa Electronic Market http://tippie.uiowa.edu/iem/ Issues known to be important in the 2000 presidential election Information Retrieval with Time Series Query Price ($) Apple Stock Price News 70 60 50 40 30 20 10 0 2000

2001 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 2 0 0 20 0 20 0 2 0 0 2 0 0 20 0 2 0 0 2 0 0 2 0 0 20 0 2 0 0 2 0 0 2 0 0 20 0 2 0 0 2 0 0 20 0 20 0 2 1/ 4 / 7 / 2 / 8 / 6 / 3 / 6 / 1 / 4 / 9 / 4 / 5 / 2 / 2/ 8 / 1 / 6 / 2 / 6 / 1 / /3 12/ 11/ 0/1 9/1 8/1 7/2 6/2 5/3 5/ 4/ 3/1 2/1 1/2 2/2 1/2 11/ 10/ 9/1 8/1 7/2 2 1 1 1 1 Date RANK DATE EXCERPT 1 9/29/2000 Expect earning will be far below 2

12/8/2000 $4 billion cash in company 3 10/19/2000 Disappointing earning report 4 Dow and Nasdaq soar after rate cut 4/19/2001 by Federal Reserve 5 7/20/2001 Apple's new retail store Hyun Duk Kim, Danila Nikitin, ChengXiang Zhai, Malu Castellanos, and Meichun Hsu. 2013. Information Retrieval with Time Series Query. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR '13),

Top ranked documents by American Airlines stock price Rank Date Excerpt 1 10/22/2001 Fleeing the war 2 12/11/2001 Us and anti-Taliban forces in Afghanistan 3 11/18/2001 Fate of Taliban Soldiers Under Discussion 4

11/12/2001 Tally and dead and missing in Sep 11 terrorist attacks 5 9/25/2001 Soldiers in Afghanistan 6 11/19/2001 Recover operation at World Trade Center 7 11/3/2001 4343 died or missing as a result of the attacks on Sep 11 8 11/17/2001 Dead and missing report of Sep 11 attack

All top ranked documents are related to September 11, terrorist attack Top ranked relevant documents by Apple stock price Rank Date Excerpt 1 9/29/2000 Fourth-quarter earning far below estimates 2 12/8/2000 $4 billion reserve, not $11 billion

3 10/19/2000 Announced earnings report 4 4/29/2001 Dow and Nasdaq soar after rate cur by Federal Reserve 5 7/20/2001 Apples new retail stores 6 12/6/2000 Apple warns it will record quarterly loss 7 3/24/2001

Stocks perk up, with Nasdaq posing gain 8 8/10/2000 Mixing Mac and Windows Retrieved relevant events: Disappointing earning report, store open, etc. (Summary) Human as subjective, intelligent sensor =

Maximization of combined intelligence of humans and computers TextScope Task Panel TextScopeTopic Analyzer Search Box MyFilter1

MyFilter2 Opinion Prediction Event Radar Microsoft (MSFT,) Google, IBM (IBM) and other cloudAviation computing rivals of Amazon Web Services are bracing for an AWS "partnership" announcement with Medical & to be announced Thursday. VMware expected Select Time Select Region

Health E-COM Stocks My WorkSpace Project 1 Alert A Alert B ... Many other users Thank You! Questions/Comments? Looking forward to opportunities to collaborate!

Recently Viewed Presentations

  • THE PROBLEM OF UNBELIEF REGARDING CHRIST AND THE

    THE PROBLEM OF UNBELIEF REGARDING CHRIST AND THE

    Eph 4:16 From whom the whole body fitly joined together and compacted by that which every joint supplieth, according to the effectual working in the measure of every part, maketh increase of the body unto the edifying of itself in...
  • files.meetup.com

    files.meetup.com

    The Future Of Women Originally "The Future Of Gender Relations" Wayne Radinsky Boulder Future Salon July 23rd, 2011
  • THE LIQUID UNIVERSE - Jasmuheen

    THE LIQUID UNIVERSE - Jasmuheen

    This is happening regardless - in other words your personal reality will only affect you now as. Earth's future has been determined via the DREAMING of millions who are now able to enjoy peace filled states
  • Point to Point protocol (PPP) - Smith College

    Point to Point protocol (PPP) - Smith College

    Point to Point protocol (PPP) ... and continues data reception PPP Link Control Protocol PPP-LCP establishes/releases the PPP connection; negotiates options Starts in DEAD state LCP Options: max frame length; authentication protocol Once PPP link established, IP-CP (Contr Prot) moves...
  • The Development of the Familys Voice Mel McEvoy

    The Development of the Familys Voice Mel McEvoy

    We're Passionate About Care Putting patients first Quality, safety and patient experience Transforming services to meet the health needs of future generations Fear most when relatives aggressive/upset/ demanding/ threatening and unable to meet their demands / expectations which are often...
  • Handelinge 5:27-32  God nooi ons uit en ons

    Handelinge 5:27-32 God nooi ons uit en ons

    Ook hy is om die lewe gebring, en al sy volgelinge is uitmekaar gejaag. 38Wat die huidige geval betref, my raad aan julle is: laat staan hierdie mense en laat hulle los, want as wat hulle wil en wat hulle...
  • IDL Programming in the ENVI Environment - GitHub Pages

    IDL Programming in the ENVI Environment - GitHub Pages

    ENVI/IDL Programming for Statistical Image Analysis ZFL 2013 Mort Canty Forschungszentrum J├╝lich [email protected] The ENVI/IDL Online Documentation, Exelis Inc. IDL Programming Techniques (2nd Edition) David W. Fanning, Fanning Software Consulting 2000 ISBN 0-9662383-2-X Practical IDL Programming L. E. Gumley (2nd...
  • Promoting dignity and rights in marginalised communities ...

    Promoting dignity and rights in marginalised communities ...

    Peter Dwyer, Professor of Social Policy, University of York . Dr Celia Hynes, Senior Lecturer, University of Salford . Katy . Jones, Research Fellow, University of Salford . Philip . Martin, Research Assistant, University of Salford. 22/03/2016. What are we...