Word Co-occurrence

Word Co-occurrence

Word Co-occurrence Chapter 3, Lin and Dryer Why is co-occurrence important? Read chapter 3 This will help you with Lab2 as well as Final Exam This will also help with future projects Help you with interview in big data analytics A simple method with big impact Co-occurrence is 2-gram, n-grams is an extension (Google has supported this )

And of course, how do you define co-occurrence is an domain-dependent issue: textsentence, paragraph etc. Temporal: within a day, week, month, co-occurrence; more complex: Bleis LDA Intelligence and Scale of Data Intelligence is a set of discoveries made by federating/processing information collected from diverse sources. Information is a cleansed form of raw data. For statistically significant information we need reasonable amount of data. For gathering good intelligence we need large amount of information. As pointed out by Jim Grey in the Fourth Paradigm book enormous amount of data

is generated by the millions of experiments and applications. Thus intelligence applications are invariably data-heavy, data-driven and dataintensive. Data is gathered from the web (public or private, covert or overt), generated by large number of domain applications. 02/09/2020 3 Intelligence (or origins of Big-data computing?) Search for Extra Terrestrial Intelligence ([email protected] project)

The Wow signal http://www.bigear.org/wow.htm 02/09/2020 4 Characteristics of intelligent applications Google search: How is different from regular search in existence before it? It took advantage of the fact the hyperlinks within web pages form an underlying structure

that can be mined to determine the importance of various pages. Restaurant and Menu suggestions: instead of Where would you like to go? Would you like to go to CityGrille? Learning capacity from previous data of habits, profiles, and other information gathered over time. Collaborative and interconnected world inference capable: facebook friend suggestion Large scale data requiring indexing Do you know amazon is going to ship things before you order? Here 02/09/2020

5 Review 1: Mapreduce Algorithm Design "simplicity" is the theme Fast "simple operation" on a large set of data Most web-mobile-internet application data yield to embarrassingly parallel processing General Idea; you write the Mapper and Reducer (Combiner and Partitioner); the execution framework takes care of the rest. Of course, you configure...the splits, the # of reducers, input path, output path,..

etc. Review 2: Programmer has NO control over -- where a mapper or reducer runs (which node in the cluster) -- when a mapper or reducer begins or finishes --which input key-value pairs are processed by a specific mapper --what intermediate key-value pair is processed by a specific reducer Review 3 However what control does a programmer have?

1. Ability to construct complex structures as keys and values to store and communicate partial results 2. The ability to execute user-specified code at the beginning of a map or a reduce task; and termination code at the end; 3. Ability to preserve state in both mappers and reducers across multiple input /intermediate values: counters 4. Ability to control sort order, order of distribution to reducers 5. ability to partition the key space to reducers Lets move on co-occurrence (Section 3.2)

Word counting is not the only example.. Another example: co-occurrence matrix large corpus: nXn matrix where n is the number of unique words in the corpus. (corpora is plural) Lets assume m words, i and j row and column index, m(i.j) cell will have the number of times w(i) co-occurred with w(j) For example is w(i) and w on twitter feed today is >1000, more than what it would been in December. Lets look at the algorithm. You need this for your Lab2. Word Co-occurrence Pairs version

1: class Mapper 2: method Map(docid a; doc d) 3: for all term w 2 doc d do 4: for all term u 2 Neighbors(w) do 5: Emit(pair (w; u); count 1) . Emit count for each co-occurrence 1: class Reducer 2: method Reduce(pair p; counts [c1; c2; : : :]) 3: s <- 0 4: for all count c in counts [c1; c2; : : :] do 5: s s + c . Sum co-occurrence counts 6: Emit(pair p; count s)

Word Co-occurrence Stripes version 1.class Mapper 2: method Map(docid a; doc d) 3: for all term w in doc d do 4: H <-new AssociativeArray 5: for all term u in Neighbors(w) do 6: H{u} <-H{u} + 1 . //Tally words co-occurring with w 7: Emit(Term w; Stripe H) 1: class Reducer

2: method Reduce(term w; stripes [H1;H2;H3; : : :]) 3: Hf <-new AssociativeArray 4: for all stripe H in stripes [H1;H2;H3; : : :] do 5: Sum(Hf ,H) // Element-wise sum lots of small values into big value 6: Emit(term w; stripe Hf ) Run it on AWS and evaluate the two approaches Summary/Observation 1.Word co-occurrence is proposed as solution for evaluating

association! 2. Two methods proposed: pairs, stripes 3. MR implementation designed (pseudo code) 4. Implemented on MR on amazon cloud 5. Evaluated and relative performance studied (R2, runtime, scale) Lab2 Discussion Build a MR data pipeline All processing in big-data done in MR Twitter : get tweets by keyword Cleaning done by MR (NOT BY Rstudio) Analyze using MR NYTimes: Get news by keyword Cleaning done by MR Analyze

using MR Common crawl: get data filter by keyword using MR clean using MR Analyze using MR

Recently Viewed Presentations

  • HIARC FD 2019 - qsl.net

    HIARC FD 2019 - qsl.net

    HIARC FD 2019 SATURDAY JUNE 22 - SUNDAY JUNE 23 Operation 1800 UTC Saturday to 1800 UTC Sunday GRANT SEAFOOD FESTIVAL SITE 2A OPERATION (2HF + 1 VHF/UHF)
  • Cholesterol and Heart Disease - CSL

    Cholesterol and Heart Disease - CSL

    Cholesterol and Heart Disease Plaques Buildup in arteries is composed of proteins, lipids, and cholesterol When blood vessels are plugged up, you get heart attacks or strokes. 1.5 million deaths per year from heart attacks 600,000 deaths from cancer *...
  • Alcohol: Impact and Reduction - Job Corps

    Alcohol: Impact and Reduction - Job Corps

    21.9% report use of any illicit drug. 18.8 % report use of marijuana . 28.7% of 12th graders report being drunk. Source: Monitoring the future. National results on adolescent drug use. Overview of key findings 2007
  • Good to Great Journey Carole C. Jakeway, RN,

    Good to Great Journey Carole C. Jakeway, RN,

    Good to Great® Journey. Carole C. Jakeway, RN, MPHChief Nurse . Director, District and County OperationsAugust 24, 2015. The GOOD TO GREAT trademark is owned by The Good to Great Project LLC.
  • Barthes: Anchorage

    Barthes: Anchorage

    Semiotics: The study of systems of signs, the process of signification, and the production of codes Some terms: Sign: signifier/signified Syntagmatic/paradigmatic ...
  • Othello - Sewanhaka High School

    Othello - Sewanhaka High School

    Othello : A Christian Moor and general of the armies of Venice, Othello is an eloquent and physically powerful figure, respected by all those around him. In spite of his elevated status, he is nevertheless easy prey to insecurities because...
  • Presentación de PowerPoint

    Presentación de PowerPoint

    fernando ariel martinez mendoza jefe de la oficina de la c. secretariodel. trabajo. mm01. claudia . josefina . ochoa . cabrera . subdirectora . de relaciones ...
  • Human Factors Risk Culture - raes-hfg.com

    Human Factors Risk Culture - raes-hfg.com

    Human Factors Risk Culture James Reason Emeritus Professor ... 'Take time, take charge'. Thinksafe SAM: Steps are S=spot the hazard, A=assess the risk, M=Make changes. Esso's 'Step back five by five'. Defensive driver training. Three-bucket assessments The 3-bucket model for...