html5lib and ElementTree for scraping basketball schedules

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

html5lib and ElementTree for scraping basketball schedules

Dan Connolly

You may have seen:

  A new Basketball season brings a new episode
  in the personal information disaster
  by connolly on Thu, 2006-11-16 12:39
  tags: calendar | GRDDL | microformats | RDF | XHTML
  http://dig.csail.mit.edu/breadcrumbs/node/172

A new schedule came this week, and it had an unexpected linebreak,
so I upgraded from tidy and regular expressions
to html5lib and ElementTree.

This message has most of the raw materials for another
breadcrumbs episode...


import html5lib # http://code.google.com/p/html5lib/
from html5lib import HTMLParser, treebuilders
from xml.etree import cElementTree


def parseHTML(fn="bball-practice.html"):
    """
    >>> e = parseHTML()
    >>> e.tag
    'html'
    >>> rows = e.getiterator('tr')
    >>> len(list(rows))
    24
    """
    f = open(fn)
    parser = HTMLParser(tree=treebuilders.getTreeBuilder("etree",
cElementTree))
    return parser.parse(f)

...

def eachEvent(...):
    ...

    for t in elt.getiterator('table'):
        cell = t.find('tbody/tr/td')
        if not cell: continue
        hd = cell.findtext('b')
        if not hd: continue
        if 'First Name' in hd: break
    else:
        raise ValueError, elt


for my reference, some hg logs:
       
16:ae65b101cf4c 2007-11-18 got html5lib talking with etree
17:5f81574c79fb 2007-11-18 - use html5lib and ElementTree rather than
tidy and regular expressions



--
Dan Connolly, W3C http://www.w3.org/People/Connolly/
gpg D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E