2010 November

Scraping It Together

I’m very excited to announce a new Dexy feature: now you can automatically fetch remote data and incorporate it into your Dexy document. Dexy will even cache the file for you and only download it again if it has changed on the remote server (http only for now, assuming either ETag or Last-Modified headers are present). One of the first things this makes possible is easily fetching and using data from remote APIs. In this blog post we’ll see how this works, using an example from ScraperWiki.

ScraperWiki is a fantastic project which aims to make open data easier to use. It’s a library of scrapers which take public data sets (typically in HTML, XLS or PDF formats), clean them up and make them available via a standard API or a CSV download. Rather than one person writing a scraper, running it locally and doing something with the data, with ScraperWiki anyone can write a scraper, someone else (hopefully several people) can write an article or do research based on the scraped data, and if the scraper breaks anyone can come along and fix it. It’s a great platform for loose collaboration among “hacks and hackers”. I especially love that it’s so easy to get involved without making a huge commitment, if you have a few minutes you can go to ScraperWiki and see broken scrapers that need fixing or write a new scraper that has been requested by someone.

