Web Link List Example

A WebLinkListExample simply shows what it takes, including installation and configuration, to provide a command-line program that, given the URL of a Web page, answers with a list of the URLs accompanying the HREFs on that page. For example:

 LinkList http://www.google.com

Answers (on 14-Apr-2004):

 /imghp?hl=en$tab=wi&ie=UTF-8
 /grphp?hl=en$tab=wi&ie=UTF-8
 /imghp?hl=en$tab=wi&ie=UTF-8
 /nwshp?hl=en$tab=wi&ie=UTF-8
 /froogle?hl=en$tab=wi&ie=UTF-8
 /froogle?hl=en$tab=wi&ie=UTF-8
 /options/index.html
 /ads/
 /services/

It is important that each example include URLs for downloads, installation instructions, and step-by-step instructions to allow a novice developer to get the example working. We can assume the developer is experienced in a different language, but has never seen this platform and language before.

While discussing DotNetAsDistributedObjectSystem, it was claimed that communicating between systems using HTTP is now trivial, and that DotNet provides no significant advantage in comparison to others. Hopefully the examples will advance the discussion by providing concrete working examples in various languages and environments for comparison.

Here are some examples (please add to this list):


Nailing down the spec

Actually there are more links in google's home page:

 /about.html
 /advanced_search?hl=en
 /preferences?hl=en

So a better specification would be executable specification. For lack of better quick solutions, let this be the spec:

 curl -s TARGET_URL | perl -e "while (\$line=<stdin>) { while (\$line=~  /(href\\s*=\\s*\\\"([^\\\"]*))|(href\\s*=\\s*([^\\s\\\">]+)[>])|(href\\s*=\\s*\\'([^\\']*)\\')/gi ) { print \$2 . \$4 . \$6 .\"\\n\" ;} }"

This RegularExpression is further subject to false positives as http=url may appear outside the context of a proper tag. I don't believe it can be perfectly solved with RegularExpressions (I'd be delighted if someone can disprove my assertion), so a better solution would make use of HTMLTidy or some other validating HMTL parser. In any case, the meat of the challenge was to download something from an URL and subject it to a simple RegularExpression-based search.

Well...you really should use the standard xml parser in perl, but when screwing around on the command line, I've always had very good luck with exactly this task, over the years, just by arranging for every tag to be on its own line and then include the "<a href=" in the pattern match.

tr -d \012 pipe to sed to insert newline before every < and after every >

Or put this fool thing in a script instead of pretending it's simple enough for a one-off command line, and then perl can do all the work, including the curl part. :-)

On the other hand, that would involve downloading two modules from CPAN, which would probably make Tom unhappy, since he wanted simplicity.


CategoryComparisons


EditText of this page (last edited December 30, 2005) or FindPage with title or text search