I've been looking into HTML parsers recently and I've explored two technologies. The first is SAXON XQL. This is a language which is tricky to master in terms of looping and syntax. Here's an example Saxon XQL file.

declare namespace my = 'my:stuff';

declare variable $my:global as xs:integer := 1;

declare function my:mylen($n as xs:integer) as xs:integer {

$n - 9

};

let $qry := doc("workarea/xfile.html")

return

<div>

{ for $qryItem at $pos in $qry//*:li[@class="g"]

return

if ($pos < 4) then

<div>

<h2>

<a href="{string($qryItem//*:h3//*:a/@href)}">

{string($qryItem//*:h3//*:a)}

</a>

</h2>

{substring(substring-before(string($qryItem//*:div[@class="s"]),"SomeString"),1,my:mylen(string-length(substring-before(string($qryItem//*:div[@class="s"]),"SomeString" ))))}

<br/>

</div>

else ()

}

</div>

One of the problems with Saxon is that it requires xhtml compliant html and it needs to be preprocessed. However, it you use Python's Beautiful Soup, it handles the fact that this is the case and the code is object oriented and quite readable.

data = open(filename)

soup = BeautifulSoup(data)

print soup.html.head.title.string

#for anchor in soup.findAll('a', href=True):

# print anchor['href']

firstnode = soup.find('ul', { "class" : "foo_results" } )

#print firstnode

hreflist = []

anchorcontent = []

bodycontent = []

secondnode = firstnode.findAll('h3')

#print secondnode

for thirdnode in secondnode:

thelink = thirdnode.find('a')

hreflist.append(thelink['href'])

anchorcontent.append(thelink.renderContents())

thebody = firstnode.findAll('p')

for abodyitem in thebody:

#print 'Body item:' + abodyitem.renderContents()

bodycontent.append(abodyitem.renderContents())

print len(hreflist)

Also, it's Python which is very easy to read and there's great support for the libraries. So, I'd recommend Beautiful Soup written in Python over Saxon. Get Beautiful Soup here.