MIS2502: Data Analytics Semi-structured Data Analytics

MIS2502: Data Analytics Semi-structured Data Analytics

MIS2502: Data Analytics Semi-structured Data Analytics Zhe (Joe) Deng [email protected] http://community.mis.temple.edu/zdeng 1 Relational databases are highly structured Tables have the same number of fields for every record Each field has a specified data type Data types have a specified length and precision Not all data

is stored like that Comma-separated value (CSV) file. Each value is separated by a comma. Other than that it is plain text. No specified field lengths. The first row is often the field/column names. Recall the UCI Machine Learning Repository: http:// mlr.cs.umass.edu/ml/machine-learning-databases/a dult/adult.data Role of quotation marks The quotes dont imply a data type. Notice that the ID is in quotes but the height and mass are not. The quotes just allow commas to be considered part of the value, not a separator. The CSV file format is not fully standardized. The only standardized rule is the basic idea of separating fields with a comma. This is also a valid CSV file

Semi-Structured Data Structured data Organized according to a formal data model (i.e., relational schema) Semistructured data No formal data model, but contains symbols to separate and label data elements Unstructured

data No data model and no predefined organization t ur ed T me ext nts I ma ges doc u XM L JSO & N

Uns t ruc Sem Stru ictu red CSV Rel dat ationa aba l ses Stru ctu red Data Examples

Would you consider an Excel spreadsheet structured, semi-structured, or unstructured? Why care about semi-structured and unstructured data? Semistructured data Common way to transfer data between software applications Because plain-text is universal, datasets are often posted using semi-structured formats Unstructured data Its everywhere Up to 70% to 80% of an organizations data may be in unstructured forms (Wikipedia)

The CSV format is still quite structured You cant skip values in a row This means the year for Watkins is 3.2 and she doesnt have a GPA You have to be careful when using commas as part of your data but theres no way to create data hierarchies Cant make first and last part of name Alternatives to CSV for semistructured data XML JSON Extensible Markup Language

JavaScript Object Notation Extensible Markup Language Plain text file Uses text for values between tags for labels data 172 Values can be of any length Commas and quotes are valid Fields can be skipped Remove 75 from C-3PO and skin color is still gold Starts and ends with a tag (often or ) Hierarchies in XML We know we can break up name into first and last

But we are also nesting it under name So first and last are now attributes of name Easier to find what youre looking for and organize your data 1 Luke Skywalker 172 77 blond fair blue 19

male Tatooine And id, name, height, mass, etc., are all nested under Character Bottom line for XML XML is better than CSVs for semi-structured data Allow for hierarchies More flexible Easier to read But XML takes up a lot more space with all of those tags Starwars.csv 6,251 bytes Starwars.xml 28,521 bytes JavaScript Object Notation Plain text file Organized as objects within braces { }

Uses key-value pairs key: value or JSON object name: C-3PO keys are field names; strings in quotes values are the data; strings, numbers, Boolean (quotes around strings required) a comma separates the key-value pairs Values can be any length Fields can be skipped Remove mass: 75 from JSON object

Hierarchies in JSON { "Character": { "id": "1", "name": { "first": "Luke", "last": "Skywalker" }, "height": "172", "mass": "77", "hair_color": "blond", "skin_color": "fair", "eye_color": "blue", "birth_year": "19", "gender": "male", "homeworld": "Tatooine" } We can have first and last nested as attributes of name, just like XML

And all the fields (id, name, height) are attributes of Character } Bottom line for JSON Best aspects of XML and CSV More lightweight than XML Starwars.csv 6,251 bytes Starwars.xml 28,521 bytes Starwars.json 21,074 bytes Supports hierarchies like XML JSON becoming the standard for transferring data across the web Same data, four different ways XML file Relational database table first

last year GPA Bob Smith Sophomore 3.4 Judy Jones Senior

3.9 Barbara Watkins Junior 3.2 CSV file first,last,year,GPA Bob,Smith,Sophomore,3.4 Judy,Jones,Senior,3.9 Barbara,Watkins,Junior,3.2 Bob Smith

JSON file [ { "first": "Bob", "last": "Smith", "year": "Sophomore", "GPA": 3.4 Sophomore 3.4 Judy Jones Senior 3.9 Barbara Watkins Junior

3.2 }, { "first": "Judy", "last": "Jones", "year": "Senior", "GPA": 3.9 }, { "first": "Barbara", "last": "Watkins", "year": "Junior", "GPA": 3.2 } ] JSON and Web APIs Web Application Program Interface (API)

Software that exposes functionality through a web interface Use the language of web software to send and receive messages (and exchange data) Some examples Requesting a web page REQUEST: http://www.google.com RESPONSE: which is really This is HTML and JavaScript and CSS. All you need to know is that the web server is sending back a lot of text that tells your web browser what to do. Googles Web Server JSON and web APIs

For us, Web APIs are just a way of getting data JSON is a popular format to package the data SELECT actor.first_name, actor.last_name FROM moviedb.actor; MySQL Database Server PENELOPE GUINESS, NICK WAHLBERG, ED CHASE, JENNIFER DAVIS. Database Server with Web API https://swapi.co/api/people/1/?format=json {"name":"Luke Skywalker","height":"172","mass":"77", "hair_color":"blond","skin_color":"fair This workstry it! Applications use APIs to

communicate with each other by exchanging data Your web browser on your phone, laptop, desktop computer Note there is no web browser involved here! http://api.paypal.com/payments/... Amazon.com web server

serves the familiar web interface you know APPROVED! Amazon.com application server processes orders, maintains your cart, and makes recommendations Amazon.com database server stores customer, product, and order data JSON and data analytics JSON is just another data format

JSON files can be read by analytics software, including R So can CSV files And XML files And Excel files

Recently Viewed Presentations

  • Diapositiva 1 - WordPress.com

    Diapositiva 1 - WordPress.com

    Diferenciación (maduración de procesos fisiológicos): origina la formación de tejidos y órganos capaces de llevar a cabo funciones especializadas. * * Un fenómeno importante en la adquisición de la forma corporal es el plegamiento del disco embrionario trilaminar plano hacia...
  • Competent Cells formation and transformation of competent ...

    Competent Cells formation and transformation of competent ...

    Competent Cells formation and transformation of competent Cells with DNA. Principle of the experiment: " chemical transformation method " Cells are incubated in CaCl2 solution that help the cells to take up the DNA plasmid by increasing the bacterial cell...
  • General Shop Safety Objectives  Basic Principles Shop Rules

    General Shop Safety Objectives Basic Principles Shop Rules

    Shop Rules. Pay attention and follow all directions given by the teacher. Safety glasses are to be worn at all times in the shop. When the class is sent to the shop, all students must orderly/quietly go directly to the...
  • Biomedical Computational Science PREF Outreach Program

    Biomedical Computational Science PREF Outreach Program

    Computational Science and Engineering. ... University of Waterloo. Some Non-Biomedical Meshing Applications. Summary. There are many opportunities for computational scientists to aid doctors. Mesh generation is an important tool for computational biomedical science.
  • News from the Granting Agencies / Nouvelles des

    News from the Granting Agencies / Nouvelles des

    Serge Villemure (NSERC/CRSNG) CAGS/ACES Annual Meeting/Rencontre annuelle ... Efforts for the Coming Years Complete evaluation of CGS & doctoral initiatives Harmonization of MSFSS Somewhat harmonized already MSFSS to be part of the CGS application process? Status-quo for 2014-15; No quota...
  • Preferred Font is Memphis Bold

    Preferred Font is Memphis Bold

    ShopKey information is critical and a necessity at the fingertips of every technician. At the dash, under the hood under the car or even in the trunk. Ask-a-Tech creates an access point to communicate and access others experience when you...
  • Sample Title Slide Presentation Title Here

    Sample Title Slide Presentation Title Here

    Associate a tile coordinate with the wire and a run-time router can keep a list of wires to avoid JRoute has a method accessable to user to mark an individual wire Reconfigurable CAM CAM stands for Content Addressable Memory give...
  • Its emergence and application in selected European countries

    Its emergence and application in selected European countries

    Followed by nationan initiatives in Denmark, Ireland, the UK. Joint Quality initiative (JQI) of the 'Dublin descriptors', the Trans-European Evaluation Project (TEEP), the Tuning project, * Was seen as necessary for Bologna success Emphasis on national sense of ownership, not...