in Projects, Programming, Technology

Attack of the Screen Scrapers

2007 is the year of the mashup. Standardized SOAP Web Services were supposed to make inter-application interoperability over the internet straightforward, but many companies are not motivated to provide public APIs. What happens when the site whose data you want to use in your mashup doesn’t provide an API? Screen scraping – using a program to read data from a web site intended for humans – is the icky but necessary solution.

Why is screen scraping icky?

  • Web pages are intended for humans to read, the site operators whose pages you are scraping may not appreciate you hitting pages using a program. It will cause extra load on their servers and skew their server stats.
  • Screen scraping robots are generally fragile, and need to be updated after re-designs to the pages you are scraping. This means if you build an application using a screen scraping robot, it can break at any time.
  • Robots become complex (and more fragile) quickly. Doing anything interesting with a screen scraper will require intimate knowledge of the Html that you are scraping. Most interesting tasks will require multiple page requests. You may need to fake values in a form post submission, or even supply cookie values.

A couple of companies have cropped up that are trying to build businesses providing scraping tools and services. Their existence, legitimizes screen scraping. The two I am aware of are:

  • Dapper, funded by Accel Partners (http://www.accel.com/)
  • KapowTech, who make openKapow (a free tool) and alsow sell a robot server stack.

Because screen scraping scripts can be fragile, having good tools is critical. The openKapow client is a deluxe tool. It is impressive at first, but its visual programming features are perhaps aimed at people who don’t want to learn programming. I would much rather have a tool that outputs an XML script or Java class.

Both of these services use proprietary hosted code to actually run the robot you define. This is nice when you are getting started, because you don’t have to worry about uploading anything to a server. The downside is that you are locked in to a proprietary hosted framework (unless you opt to buy the kapow server edition), and won’t be able to do any tweaks or bug fixes except using their tools.

Having looked carefully at both these options, I think serious developers will still need to develop their own screen scraping libraries. Dapper is a good tool for a quick and simple mashup (though it didn’t work for a moderately complex page I threw at it). OpenKapow isn’t worth learning unless purchasing the full product is an option. The more time you spend learning the tool, the more committed you are to the platform.

I would like to see open source screen scraping libraries (and may try to open source any that I write). I think screen scraping is an icky but inescapable technique for the growing Web service ecosystem.

Write a Comment

Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  1. hi Adam, thanks for the interesting introductory article.

    Just a quick note to mention that while screen-scraping has gotten a lot of recent attention due to
    its role in mashups, there has been strong commerical interest
    in data extraction/integration for many years, originating from a flurry
    of academic
    research in
    the mid-90s.

    Readers interested in industrial-strength Web data
    extraction/integration should check out WebQL, QL2 Software’s powerful
    tool for converting unstructured Web content into actionable
    structured data.

    regards,
    Nicholas Kushmerick
    Chief Scientist
    QL2 Software, Inc.
    http://www.QL2.com

  2. I agree with Nicholas

    Kapow Technologies today has more than 200 customers and a lot of them use our product for collecting data from the web. Several of these customers are very well known web 2.0 companies using Kapow behind the scene for web scraping of data. With openkapow we wanted to give everyone free access to the power of our product.