+ - 0:00:00
Notes for current slide
Notes for next slide

More HTML Parsing

Pragmatic Datafication - DSVIL 2018

John Little

2018-05-30

1 / 13

Scrape

scraping propolis

Using tools to gather data you can see on a webpage

Parse

Analyzing the strings and symbols to reveal only the data you need

strain honey from comb

2 / 13

HTML

Hypertext Markup Language

<!DOCTYPE html>
<html>
<!-- created 2010-01-01 -->
<head>
<title>sample</title>
</head>
<body>
<p>Voluptatem accusantium
totam rem aperiam.</p>
</body>
</html>
3 / 13

URL

elements of a URL

  • Server sends back marked-up content

    • HTML
    • CSS StyleSheets
    • JavaScript
  • Browser parses the HTML

4 / 13

Image Credit:

OpenRefine (Fetch & Parse)

5 / 13

Process Outline

  1. Fetch HTML by URL from Web Server

    • Edit column > Add column by fetching URL...
  2. Identify CSS class and id values. Typically using the Chrome Browser's Inspect feature works best

    • in Chrome: highlight text > right-click > inspect (Chrome Developer Mode)
  3. Parse HTML

    • Edit column > Add column based on this column
      • parseHTML()
      • select("HTML/CSS handle")
      • choose an array element, typically [0]
      • convert element:
        • htmlText()
        • htmlAttr("attribute handle")
6 / 13

Parsing by HTML elements and CSS

  • Edit Column > Add column based on this column

    • value.parseHtml().select("title")[0].htmlText()

Parsing by Brute Force

  • Edit Column > Add column based on this column

    1. value.split("</title>")[0]
    2. Edit Cells > Transform > value.split("<title>")[1]
7 / 13

Now You Try It

  1. HTML Parsing

  2. If you're speedy complete as many other steps as suit you

FYI

  • Steps 7 & 8 implement looping control

    • convert the display to records instead of rows
  • Inspect your work. See section (7.3) in the Workbook

9 / 13

Review

From price-press_releases.csv

date2
value.parseHtml().select("div.pane-content")[0].htmlText()
dateline
value.parseHtml().select("div.field-item.even p strong")[0].htmlText()
Links
forEach(
value.parseHtml().select("div#block-system-main")[0].select("a"),
v,
v.htmlAttr("href")
).join("|")
Link Text
forEach(value.parseHtml().select("div#block-system-main")[0].select("a"),
v,
v.htmlText()
).join("|")
10 / 13

John Little

I am ...

John Little

Your Rfun host...

You can make Rfun with our resources for R and data science analytics. See the R we having fun yet‽ resource pages.

Duke Univesrity...

Data & Visualization Services

12 / 13

Shareable under CC BY-NC license

Data, presentation, and handouts are shareable under CC BY-NC license

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

13 / 13

Scrape

scraping propolis

Using tools to gather data you can see on a webpage

Parse

Analyzing the strings and symbols to reveal only the data you need

strain honey from comb

2 / 13
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow