Reading HTML with Python

Today I read some HTML data from a table using Python. The tools I ended up using to get the data were:

import urllib
f = urllib.urlopen(<URL_string>)
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Then to parse tables in the HTML I used lxml which can read HTML and provide an ElementTree like API.

from lxml import etree
html = etree.HTML(s)

I was then able to do everything I needed using list() on the element to get the children and using the attributes .text and .tag to get the text out of tables and links as desired.

fromthepantothefire

"…we are not pans and barrows, nor even porters of the fire and torchbearers, but children of the fire, made of it…" — Ralph Waldo Emerson, The Poet

Reading HTML with Python