Web Link List Example In Python

PythonLanguage implementations of web link lister. (See WebLinkListExample.)

Here's an implementation that lists the hrefs from a web document using a HyperTextMarkupLanguage parser. It uses standard modules HTMLParser, urllib, and sys. No special setup is required after Python is installed. The implementation is pretty idiot-proof and fails gracefully when an error is encountered.

 'linklister1.py'

 from HTMLParser import HTMLParser
 class LinkParser(HTMLParser):
 def reset(self):
 HTMLParser.reset(self)
 self.hrefs = []
 def handle_starttag(self, tag, attrs):
 href = [v for k,v in attrs if k.lower() == 'href']
 if href: self.hrefs.extend(href)

 def listLinks(url):
 import urllib
 print '*** links in', url, '***'
 parser = LinkParser()
 try:
 parser.feed(urllib.urlopen(url).read())
 except Exception, e:
 print '%s: %s\n' % (e, url)
 return
 parser.close()
 for href in parser.hrefs: print href
 if not parser.hrefs: print 'None!'
 print

 if __name__ == '__main__':
 import sys
 if len(sys.argv) > 1:
 for url in sys.argv[1:]: listLinks(url)
 else: print 'usage: linklister1 url [url2 ...]'

The program can parse both HTML and XHTML. It retrieves the values of all href attributes it finds in all the start tags. The LinkParser.handle_starttag method can be modified to process only the tags of interest, e.g., <a> and <area> but not <link> and <base>.

Results from www.google.com on 19-Apr-2004

 *** links in http://www.google.com/ ***
 /imghp?hl=en&tab=wi&ie=UTF-8
 /grphp?hl=en&tab=wg&ie=UTF-8
 /nwshp?hl=en&tab=wn&ie=UTF-8
 /froogle?hl=en&tab=wf&ie=UTF-8
 /froogle?hl=en&tab=wf&ie=UTF-8
 /options/index.html
 /advanced_search?hl=en
 /preferences?hl=en
 /language_tools?hl=en
 /ads/
 /services/
 /about.html

For something completely different, here's an implementation that lists all the hrefs using RegularExpressions. It uses standard modules urllib, re, and sys. No special setup is required after Python is installed. The implementation is pretty idiot-proof and fails gracefully when an error is encountered.

Tests with Python's timeit module show this regexp implementation is about 22% faster than the HTMLParser implementation on a local html file with 59 hrefs, using Python 2.3 on MacOs 9.

 'linklister2.py'

 def listHrefs(url):
 import urllib, re
 lookFor = r'href\s*=\s*\"([^\"]*)|href\s*=\s*\'([^\']*)|href\s*=\s*([^\s\"\'>]+)'
 clookFor = re.compile(lookFor, re.I)
 print '*** hrefs in', url, '***'
 try:
 document = urllib.urlopen(url).read()
 except Exception, e:
 print '%s: %s\n' % (e, url)
 return
 hrefs = clookFor.findall(document)
 for tup in hrefs:
 for href in tup:
 if href: print href
 if not hrefs: print 'None!'
 print

 if __name__ == '__main__':
 import sys
 if len(sys.argv) > 1:
 for url in sys.argv[1:]: listHrefs(url)
 else: print 'usage: linklister2 url [url2 ...]'

The program can search any document until an EOF marker is encountered. If there happens to exist 'href="blah"' in the text of a web document, this regexp implementation finds it in addition to the href attributes in the tags.

Results from www.google.com on 19-Apr-2004

 *** hrefs in http://www.google.com/ ***
 /imghp?hl=en&tab=wi&ie=UTF-8
 /grphp?hl=en&tab=wg&ie=UTF-8
 /nwshp?hl=en&tab=wn&ie=UTF-8
 /froogle?hl=en&tab=wf&ie=UTF-8
 /froogle?hl=en&tab=wf&ie=UTF-8
 /options/index.html
 /advanced_search?hl=en
 /preferences?hl=en
 /language_tools?hl=en
 /ads/
 /services/
 /about.html

Contributor(s): ElizabethWiethoff

CategoryPython