2007 is the year of the mashup. Standardized SOAP Web Services were supposed to make inter-application interoperability over the internet straightforward, but many companies are not motivated to provide public APIs. What happens when the site whose data you want to use in your mashup doesn’t provide an API? Screen scraping – using a program to read data from a web site intended for humans – is the icky but necessary solution.
Why is screen scraping icky?
- Web pages are intended for humans to read, the site operators whose pages you are scraping may not appreciate you hitting pages using a program. It will cause extra load on their servers and skew their server stats.
- Screen scraping robots are generally fragile, and need to be updated after re-designs to the pages you are scraping. This means if you build an application using a screen scraping robot, it can break at any time.
- Robots become complex (and more fragile) quickly. Doing anything interesting with a screen scraper will require intimate knowledge of the Html that you are scraping. Most interesting tasks will require multiple page requests. You may need to fake values in a form post submission, or even supply cookie values.
A couple of companies have cropped up that are trying to build businesses providing scraping tools and services. Their existence, legitimizes screen scraping. The two I am aware of are:
- Dapper, funded by Accel Partners (http://www.accel.com/)
- KapowTech, who make openKapow (a free tool) and alsow sell a robot server stack.
Both of these services use proprietary hosted code to actually run the robot you define. This is nice when you are getting started, because you don’t have to worry about uploading anything to a server. The downside is that you are locked in to a proprietary hosted framework (unless you opt to buy the kapow server edition), and won’t be able to do any tweaks or bug fixes except using their tools.
Having looked carefully at both these options, I think serious developers will still need to develop their own screen scraping libraries. Dapper is a good tool for a quick and simple mashup (though it didn’t work for a moderately complex page I threw at it). OpenKapow isn’t worth learning unless purchasing the full product is an option. The more time you spend learning the tool, the more committed you are to the platform.
I would like to see open source screen scraping libraries (and may try to open source any that I write). I think screen scraping is an icky but inescapable technique for the growing Web service ecosystem.