Reading HTML with Python

Today I read some HTML data from a table using Python. The tools I ended up using to get the data were:

import urllib
f = urllib.urlopen(<URL_string>)
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Then to parse tables in the HTML I used lxml which can read HTML and provide an ElementTree like API.

from lxml import etree
html = etree.HTML(s)

I was then able to do everything I needed using list() on the element to get the children and using the attributes .text and .tag to get the text out of tables and links as desired.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>