Web Link List Example In Python

PythonLanguage implementations of web link lister. (See WebLinkListExample.)

Here's an implementation that lists the hrefs from a web document using a HyperTextMarkupLanguage parser. It uses standard modules HTMLParser, urllib, and sys. No special setup is required after Python is installed. The implementation is pretty idiot-proof and fails gracefully when an error is encountered.

 'linklister1.py'

from HTMLParser import HTMLParser class LinkParser(HTMLParser): def reset(self): HTMLParser.reset(self) self.hrefs = [] def handle_starttag(self, tag, attrs): href = [v for k,v in attrs if k.lower() == 'href'] if href: self.hrefs.extend(href)

def listLinks(url): import urllib print '*** links in', url, '***' parser = LinkParser() try: parser.feed(urllib.urlopen(url).read()) except Exception, e: print '%s: %s\n' % (e, url) return parser.close() for href in parser.hrefs: print href if not parser.hrefs: print 'None!' print

if __name__ == '__main__': import sys if len(sys.argv) > 1: for url in sys.argv[1:]: listLinks(url) else: print 'usage: linklister1 url [url2 ...]'
The program can parse both HTML and XHTML. It retrieves the values of all href attributes it finds in all the start tags. The LinkParser.handle_starttag method can be modified to process only the tags of interest, e.g., <a> and <area> but not <link> and <base>.

Results from www.google.com on 19-Apr-2004

 *** links in http://www.google.com/ ***
 /imghp?hl=en&tab=wi&ie=UTF-8
 /grphp?hl=en&tab=wg&ie=UTF-8
 /nwshp?hl=en&tab=wn&ie=UTF-8
 /froogle?hl=en&tab=wf&ie=UTF-8
 /froogle?hl=en&tab=wf&ie=UTF-8
 /options/index.html
 /advanced_search?hl=en
 /preferences?hl=en
 /language_tools?hl=en
 /ads/
 /services/
 /about.html


For something completely different, here's an implementation that lists all the hrefs using RegularExpressions. It uses standard modules urllib, re, and sys. No special setup is required after Python is installed. The implementation is pretty idiot-proof and fails gracefully when an error is encountered.

Tests with Python's timeit module show this regexp implementation is about 22% faster than the HTMLParser implementation on a local html file with 59 hrefs, using Python 2.3 on MacOs 9.

 'linklister2.py'

def listHrefs(url): import urllib, re lookFor = r'href\s*=\s*\"([^\"]*)|href\s*=\s*\'([^\']*)|href\s*=\s*([^\s\"\'>]+)' clookFor = re.compile(lookFor, re.I) print '*** hrefs in', url, '***' try: document = urllib.urlopen(url).read() except Exception, e: print '%s: %s\n' % (e, url) return hrefs = clookFor.findall(document) for tup in hrefs: for href in tup: if href: print href if not hrefs: print 'None!' print

if __name__ == '__main__': import sys if len(sys.argv) > 1: for url in sys.argv[1:]: listHrefs(url) else: print 'usage: linklister2 url [url2 ...]'
The program can search any document until an EOF marker is encountered. If there happens to exist 'href="blah"' in the text of a web document, this regexp implementation finds it in addition to the href attributes in the tags.

Results from www.google.com on 19-Apr-2004

 *** hrefs in http://www.google.com/ ***
 /imghp?hl=en&tab=wi&ie=UTF-8
 /grphp?hl=en&tab=wg&ie=UTF-8
 /nwshp?hl=en&tab=wn&ie=UTF-8
 /froogle?hl=en&tab=wf&ie=UTF-8
 /froogle?hl=en&tab=wf&ie=UTF-8
 /options/index.html
 /advanced_search?hl=en
 /preferences?hl=en
 /language_tools?hl=en
 /ads/
 /services/
 /about.html


Contributor(s): ElizabethWiethoff


CategoryPython


EditText of this page (last edited January 4, 2012) or FindPage with title or text search