Uncategorized

Scraping It Together

November 28, 2010   ·   By   ·   Comments Off   ·   Posted in Uncategorized

I’m very excited to announce a new Dexy feature: now you can automatically fetch remote data and incorporate it into your Dexy document. Dexy will even cache the file for you and only download it again if it has changed on the remote server (http only for now, assuming either ETag or Last-Modified headers are present). One of the first things this makes possible is easily fetching and using data from remote APIs. In this blog post we’ll see how this works, using an example from ScraperWiki.

ScraperWiki is a fantastic project which aims to make open data easier to use. It’s a library of scrapers which take public data sets (typically in HTML, XLS or PDF formats), clean them up and make them available via a standard API or a CSV download. Rather than one person writing a scraper, running it locally and doing something with the data, with ScraperWiki anyone can write a scraper, someone else (hopefully several people) can write an article or do research based on the scraped data, and if the scraper breaks anyone can come along and fix it. It’s a great platform for loose collaboration among “hacks and hackers”. I especially love that it’s so easy to get involved without making a huge commitment, if you have a few minutes you can go to ScraperWiki and see broken scrapers that need fixing or write a new scraper that has been requested by someone.

While I look forward to exploring ways in which Dexy can help with ScraperWiki’s documentation, especially given that ScraperWiki allows scrapers written in Python, Ruby and PHP and Dexy is ideal for such multi-language situations, ScraperWiki already has a very interesting live-code setup for tutorials. When you click on a tutorial, you are brought into ScraperWiki’s code editor with a live example that you can play with, with extensive comments for documentation. For example, here is the first Python tutorial:

###############################################################################
# START HERE: Tutorial 1: Getting used to the ScraperWiki editing interface.
# Follow the actions listed with -- BLOCK CAPITALS below.
###############################################################################

# -----------------------------------------------------------------------------
# 1. Start by running a really simple Python script, just to make sure that 
# everything is working OK.
# -- CLICK THE 'RUN' BUTTON BELOW
# You should see some numbers print in the 'Console' tab below. If it doesn't work, 
# try reopening this page in a different browser - Chrome or the latest Firefox.
# -----------------------------------------------------------------------------

for i in range(10):
    print "Hello", i

# -----------------------------------------------------------------------------
# 2. Next, try scraping an actual web page and getting some raw HTML.
# -- UNCOMMENT THE THREE LINES BELOW (i.e. delete the # at the start of the lines)
# -- CLICK THE 'RUN' BUTTON AGAIN 
# You should see the raw HTML at the bottom of the 'Console' tab. 
# Click on the 'more' link to see it all, and the 'Sources' tab to see our URL - 
# you can click on the URL to see the original page. 
# -----------------------------------------------------------------------------

#import scraperwiki
#html = scraperwiki.scrape('http://scraperwiki.com/hello_world.html')
#print html

# -----------------------------------------------------------------------------
# In the next tutorial, you'll learn how to extract the useful parts
# from the raw HTML page.
# -----------------------------------------------------------------------------

And here is the first Ruby tutorial, along the same lines:

# Hi. Welcome to the Ruby editor window on ScraperWiki.

# To see if everything is working okay, click the RUN button on the bottom left
# to make the following four lines of code do its stuff

(1..10).each do |i|
    puts "Hello, #{i}\n" 
end

# Did it work? 10 lines should have been printed in the console window below
# If not, try using Google Chrome or the latest version of FireFox.

# The first job of any scraper is to download the text of a web-page.  
# Uncomment the next two lines of code (remove the # from the beginning of the line)
# and click RUN again to see how it works.

#html = ScraperWiki.scrape('http://scraperwiki.com/hello_world.html')
#puts html

# The text will appear in the console, and the URL that it downloaded from
# should appear under "Sources".

If you’d like to play with these yourselves, then here is the Python tutorial and here is the Ruby tutorial, or just click the Tutorials link in the sidebar from scraperwiki.com.

Now let me stop here for a moment to point out that I fetched that python and ruby code directly from scraperwiki.com using Dexy’s new remote file feature. Here is the .dexy file for this blog post:

{
  "@tutorial.py|pyg" : {
    "url" : "http://scraperwiki.com/editor/raw/tutorial-1"
  },
  "@tutorial.rb|pyg" : {
    "url" : "http://scraperwiki.com/editor/raw/ruby-tutorial-1"
  },
  "@scraper-source.py|pyg" : {
    "url" : "http://scraperwiki.com/editor/raw/foi_botanical_gardens"
  },
  "@scraper-data.csv|dexy" : {
    "url" : "http://api.scraperwiki.com/api/1.0/datastore/getdata?format=csv&name=foi_botanical_gardens&limit=500"
  },
  "scraper.R|jinja|r|pyg" : {
    "inputs" : ["@scraper-data.csv|dexy"]
  },
  ".dexy|dexy" : {},
  "scraping-it-together.dexy|dexy" : {}
}


Since JSON doesn’t have a comment character1, I can’t use Idiopidae syntax to split this into manageable chunks, however I think we can manage. The file names beginning with an @ symbol are virtual files, they don’t actually exist on the file system, but Dexy is going to pretend that they are there. So, we can do the usual Dexy things with these “files” like run them through filters or use them as inputs to other documents. The ‘url’ property tells Dexy to fetch the contents at that URL, and these become the contents of the virtual file. Any type of text-based data (eventually binary data too) can be fetched in this way. In this example I have fetched Python and Ruby code, and CSV data. We specify the file extension ourselves when we name the virtual file, so later filters will treat the downloaded text correctly. For example, Pygments knows from the .py file extension that it’s getting Python code.

1 Waaaaaaaahhhhhhhh! WHY didn’t they include a comment character!???!!!!

Next let’s take a look at the source code of a real scraper:

# Created at Python Northwest, Manchester, 2010-11-15


import scraperwiki
import datetime
import xlrd

# retrieve a page
starting_url = 'http://www.whatdotheyknow.com/request/49024/response/124689/attach/3/FOI%202010%20207%20Jones.xls'
book = xlrd.open_workbook(file_contents=scraperwiki.scrape(starting_url))
XL_EPOCH = datetime.date(1899, 12, 30)
scraperwiki.metadata.save('data_columns', ['Date', 'tmin', 'tmax', 'tmean'])


count = 0
for sheet in book.sheets():
    print 'Scraping sheet %s' % sheet.name
    for rownum in xrange(sheet.nrows):
        date_cell = sheet.cell(rownum, 0)
        min_cell = sheet.cell(rownum, 2)
        max_cell = sheet.cell(rownum, 1)
        if date_cell.ctype == xlrd.XL_CELL_DATE and xlrd.XL_CELL_NUMBER in (min_cell.ctype, max_cell.ctype):
            record = {}
            record['Date'] = str(XL_EPOCH + datetime.timedelta(days=date_cell.value))
            if min_cell.ctype == xlrd.XL_CELL_NUMBER:
                record['tmin'] = min_cell.value
            if max_cell.ctype == xlrd.XL_CELL_NUMBER:
                record['tmax'] = max_cell.value
            if min_cell.ctype == xlrd.XL_CELL_NUMBER and max_cell.ctype == xlrd.XL_CELL_NUMBER:
                record['tmean'] = (min_cell.value + max_cell.value) / 2.0
            scraperwiki.datastore.save(['Date'], record, silent=True)
            count += 1

print 'Scraped %d records' % count

This scraper scrapes a spreadsheet containing temperature observations from the Botanic Gardens at the University of Cambridge, obtained under a Freedom of Information act request which you can view at whatdotheyknow.com. The file contains daily observations from 2000 – 2010, we will just look at the 500 most recent observations (to do more would require multiple calls to the ScraperWiki api, easy to do but not really needed for demonstration purposes).

Here is what the first few lines of the resulting CSV data looks like:

Date,Maximum,Mean,Minimum,tmax,tmean,tmin
2010-09-30,6.9,11.9,16.9,16.9,11.9,6.9
2010-09-29,13.0,15.5,18.0,18.0,15.5,13.0
2010-09-28,13.0,15.15,17.3,17.3,15.15,13.0
2010-09-27,10.6,12.8,15.0,15.0,12.8,10.6
2010-09-26,8.6,11.2,13.8,13.8,11.2,8.6
2010-09-25,6.3,9.1,11.9,11.9,9.1,6.3
2010-09-24,9.5,11.7,13.9,13.9,11.7,9.5
2010-09-23,9.4,14.15,18.9,18.9,14.15,9.4
2010-09-22,9.4,16.7,24.0,24.0...

(Yes, the column names are a little wonky but the tmax, tmean and tmin correspond to the correct data so we’ll use these.)

We are going to use R to graph this data, and to tell us a little more about it via R’s summary() function. Here is the graph of daily data, the mean temperature is graphed in black with daily maximums and minimums in red and blue respectively.

Here is the R transcript which imports the CSV data, graphs it and produces some simple summary statistics:

> data <- read.table("0196faca90f48f84d40ff382512234b7.csv", sep=",", header=TRUE)
> 
> # Convert Y-m-d strings to dates, sort by date.
> data[,1] <- as.POSIXct(data[,1])
> sorted.data <- data[order(data[,1]),]
> 
> # Determine overall min and max for y axis range
> min.temp <- min(sorted.data$tmin, na.rm=TRUE)
> max.temp <- max(sorted.data$tmax, na.rm=TRUE)
> 
> png(file="111b990e-af30-4969-8212-67424545b3ae.png", width=500, height=500)
> plot(
+     sorted.data$Date, 
+     sorted.data$tmean, 
+     type="l", 
+     lwd=2, 
+     ylim=c(min.temp-2, max.temp+2),
+     ylab=expression("Temperature"*degree~C)
+ )
> points(sorted.data$Date, sorted.data$tmax, type="l", lty=3, col="red")
> points(sorted.data$Date, sorted.data$tmin, type="l", lty=3, col="blue")
> dev.off()
null device 
          1 
> 
> summary(data$tmin)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -5.700   3.500   8.200   7.534  11.700  19.100   3.000 
> summary(data$tmax)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  -0.30   12.00   18.00   16.77   22.10   30.30   23.00 
> summary(data$tmean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  -3.00    8.00   13.30   12.13   16.85   23.35   26.00 
> 

This example had a mixture of local scripts and remote data sources, which is typical when you are working on your own documents. When it comes time to share your documents with others, being able to specify remote files in Dexy opens up some really interesting possibilities.

Consider this configuration file:

{
  "@tutorial.py|pyg" : {
    "url" : "http://scraperwiki.com/editor/raw/tutorial-1"
  },
  "@tutorial.rb|pyg" : {
    "url" : "http://scraperwiki.com/editor/raw/ruby-tutorial-1"
  },
  "@scraper-source.py|pyg" : {
    "url" : "http://scraperwiki.com/editor/raw/foi_botanical_gardens"
  },
  "@scraper-data.csv|dexy" : {
    "url" : "http://api.scraperwiki.com/api/1.0/datastore/getdata?format=csv&name=foi_botanical_gardens&limit=500"
  },
  "@index.txt|jinja|textile" : {
    "url" : "http://bitbucket.org/ananelson/dexy-blog/raw/146d3429c753/2010/11/scraping-it-together/index.txt",
    "allinputs" : true
  },
  "@scraper.R|jinja|r|pyg" : {
    "url" : "http://bitbucket.org/ananelson/dexy-blog/raw/146d3429c753/2010/11/scraping-it-together/scraper.R",
    "inputs" : ["@scraper-data.csv|dexy"]
  },
  ".dexy|dexy" : {},
  "scraping-it-together.dexy|dexy" : {}
}


This is probably a good time to mention a new command-line argument available in Dexy, the -g or —config switch which lets you specify a configuration file other than the default of .dexy. If you have Dexy, RedCloth, Pygments and R installed, then you should be able to build this blog post in a blank directory by saving the above configuration as scraping-it-together.dexy and running:

dexy --setup -g scraping-it-together.dexy .

I will probably consider adding a switch which will create copies of the virtual files in the local directory, so the dexy configuration file format could be used as a way to distribute worked code tutorial examples as well as reproducible research documents.

I encourage you to take a look at the source of this blog post, especially if you are new to Dexy. The post’s directory is here on bitbucket.