Web Scraping

Taking out all the interesting bits from a Web page or site that was foolishly intended for humans rather than for exchange with other software.

Many programmers do a lot of Web scraping. Being allergic to reinventing the wheel, is there some general framework or method somewhere for scraping Web pages?

I do NOT need it to reproduce wget's functionality. I DO need it to scan for declared patterns in an HTML file, extract all the data into a format of my choosing using semantic tags of my choosing, and either nuke the appropriate block in the same file or dump any leftover bits into a different file.

So far, I've figured out that I'm supposed to use an HTML to XML conversion tool like HtmlTidy. Which unfortunately croaks on me after a half dozen files. Then I'm stuck with XML, which is supposedly an improvement. Then what?

Uh, HttpUnit? See HttpUnitTutorial for some Dilbert-scraping...

Not quite what I had in mind, which is more DataMining than unit testing. I'd like either a standalone tool that works on web pages or something available in Smalltalk. I don't intend to learn Java to scrape a dozen web sources. Make that, I don't intend to learn Java.

The ability to hit a Web page, then treat its elements as data objects and iterate them, looking for their payload, is not what you had in mind?

I don't need to hit a web page, I've got wget for that. Okay, time for a concrete example of what I'm doing.

Starting from a mirror of IMDB, I want to convert each unique movie and actor into a unique Smalltalk object, all of them properly referencing each other.

I want to do wholesale strip mining of several very large websites. To do that, I don't need to deal with HTTP (wget will do it for me). I do need some way to make sure that after I've stripped out the "payload" that there's nothing left behind. IOW, that the scraper produces distinct results and leftovers which I can examine to see if there are any objects the rules have missed.

Here's what a couple of nice people found for me:

The second outputs in HTML, which is no good for me. The first is much too interactive, and writing out the interactivity would require me to learn too much JavaScript. The third seemed to be the closest to what I have in mind, but it's old, windows only, seems to be lacking a lot of functionality, and doesn't have any documentation.

http://freshmeat.net/projects/scrapeforge/

Wrote my own as it turns out. Anyone who wants to see how bad my code is, or save themselves a day or two, can contact me. -- RichardKulisz

Web search makes it look like most of the known galaxy uses HtmlTidy to convert, which suggests that your problem with it is unusual rather than par for the course, which means there might be a workaround.

The development site is http://tidy.sourceforge.net/ as you probably know. I would try stepping up to the latest version, and/or backing down to a previous major release version, and/or trying it on Linux if you had run it on Windows before, or if you ran it on Linux before, try it on Linux. Being tediously methodical like this often finds a workaround reasonably rapidly.

Also it appears that there is very active development going on, so if you reported your bug, it might get fixed very quickly (including the html file that it crashed on would be absolutely necessary to accompany the bug report; if I were them, I'd probably ignore bug reports that didn't include test cases to reproduce the bug).

Any following conversion you need to do might well be handled completely by XSLT, if XPATH alone doesn't cut it.

http://www.imdb.com/interfaces has, among other things, a bunch of parsable gzipped text files with all the data on IMDb. And this quote from http://www.imdb.com/conditions is interesting:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

http://www.velocityscape.com makes a tool that does a pretty good job of extracting fields on a webpage and loading it into a spreadsheet or database that you define. It also has a built in crawler and form-poster. I have actually used it to automatically log in to secure sites (hoovers.com) and extract data into an excel spreadsheet. The first release I saw (2.0) was pretty buggy, but the new release is solid.

I just tried the latest release and it's still buggy (took me about 1 hour and ~5 crashes to scrape a simple table). Also, there are no transformations allowed so requires a bunch of post-processing. The architecture is nice, though -- hopefully, the next release will be better.