(htmllib) How to capture text that includes tags?

J

jennyw

I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only
includes text that isn't marked up. For example, this product
description:

<font size="1">
This rectangle measures 7&quot; x 3&quot;.
</font>

Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.

I've been looking through Google Groups but haven't found a way to get
the markup in between the tags. Any suggestions?

This is relevant portion of the class I'm using so far:

class myHTMLParser(htmllib.HTMLParser):

def __init__(self,f):
htmllib.HTMLParser.__init__(self, f)
self

def start_font(self, attrs):
self.save_bgn()

def end_font(self):
text = self.save_end()
if text:
if re.search("\\.\\s*$", text):
print "Probably a product description: " + text

# I needed to override save_end because it was having trouble
# when data was nothing.

def save_end(self):
"""Ends buffering character data and returns all data saved since
the preceding call to the save_bgn() method.

If the nofill flag is false, whitespace is collapsed to single
spaces. A call to this method without a preceding call to the
save_bgn() method will raise a TypeError exception.

"""
data = self.savedata
self.savedata = None
if data:
if not self.nofill:
data = ' '.join(data.split())
return data

Thanks!

Jen
 
M

Mathias Waack

jennyw said:
I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name,
which is in an anchor). Some (like product description) are between
tags (in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end()
only
includes text that isn't marked up. For example, this product
description:

<font size="1">
This rectangle measures 7&quot; x 3&quot;.
</font>

Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.

And whats the problem? HTML code produced by broken software like
Frontpage often contains unnecessary quotes - why do you wont to
preserve this crap?

If you want to escape special characters you can use
xml.sax.saxutils.escape() or just write your own function (escape is
only a two liner).

Mathias
 
P

Peter Otten

jennyw said:
I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only
includes text that isn't marked up. For example, this product
description:

<font size="1">
This rectangle measures 7&quot; x 3&quot;.
</font>

Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.

I've been looking through Google Groups but haven't found a way to get
the markup in between the tags. Any suggestions?

This is relevant portion of the class I'm using so far:

class myHTMLParser(htmllib.HTMLParser):

def __init__(self,f):
htmllib.HTMLParser.__init__(self, f)
self

def start_font(self, attrs):
self.save_bgn()

def end_font(self):
text = self.save_end()
if text:
if re.search("\\.\\s*$", text):
print "Probably a product description: " + text

# I needed to override save_end because it was having trouble
# when data was nothing.

def save_end(self):
"""Ends buffering character data and returns all data saved since
the preceding call to the save_bgn() method.

If the nofill flag is false, whitespace is collapsed to single
spaces. A call to this method without a preceding call to the
save_bgn() method will raise a TypeError exception.

"""
data = self.savedata
self.savedata = None
if data:
if not self.nofill:
data = ' '.join(data.split())
return data

Thanks!

Jen

I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case you
will want to keep a stack of tags instead of the simple infont flag.

import HTMLParser, htmlentitydefs

class CatalogParser(HTMLParser.HTMLParser):
entitydefs = htmlentitydefs.entitydefs

def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.infont = False
self.text = []

def handle_starttag(self, tag, atts):
if tag == "font":
assert not self.infont
self.infont = True

def handle_entityref(self, name):
if self.infont:
self.handle_data(self.entitydefs.get(name, "?"))

def handle_data(self, data):
if self.infont:
self.text.append(data)

def handle_endtag(self, tag):
if tag == "font":
assert self.infont
self.infont = False
if self.text:
print "".join(self.text)

data = """
<html>
<body>
<h1>&quot;Ignore me&quot;</h1>
<font size="1">
This &wuerg; rectangle measures 7&quot; x 3&quot;.
</font>
</body>
</html>
"""
p = CatalogParser()
p.feed(data)
p.close()

Peter
 
P

Paul Rubin

I've generally found that trying to parse the whole page with
regexps isn't appropriate. Here's a class that I use sometimes.
Basically you do something like

b = buf(urllib.urlopen(url).read())

and then search around for patterns you expect to find in the page:

b.search("name of the product")
b.rsearch('<a href="')
href = b.up_to('"')

Note that there's an esearch method that lets you do forward searches
for regexps (defaults to case independent since that's usually what
you want for html). But unfortunately, due to a deficiency in the Python
library, there's no simple way to implement backwards regexp searches.

Maybe I'll clean up the interface for this thing sometime.

================================================================

import re

class buf:
def __init__(self, text=''):
self.buf = text
self.point = 0
self.stack = []

def seek(self, offset, whence='set'):
if whence=='set':
self.point = offset
elif whence=='cur':
self.point += offset
elif whence=='end':
self.point = len(self.buf) - offset
else:
raise ValueError, "whence must be one of ('set','cur','end')"

def save(self):
self.stack.append(self.point)

def restore(self):
self.point = self.stack.pop()

def search(self, str):
p = self.buf.index(str, self.point)
self.point = p + len(str)
return self.point

def esearch(self, pat, *opts):
opts = opts or [re.I]
p = re.compile(pat, *opts)
g = p.search(self.buf, self.point)
self.point = g.end()
return self.point

def rsearch(self, str):
p = self.buf.rindex(str, 0, self.point)
self.point = p
return self.point

def up_to(self, str):
a = self.point
b = self.search(str)
return self.buf[a:b-1]
 
J

jennyw

I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case you
will want to keep a stack of tags instead of the simple infont flag.

Thanks! Whare are the main advantages of HTMLParser over htmllib?

The code gives me something to think about ... it doesn't work right now
because it turns out there are nested font tags (which means the asserts
fail, and if I comment them out, it generates a 53 MB file from a < 1 MB
source file). I'll try playing with it and seeing if I can get it to do
what I want.

It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?

Thanks again!

Jen
 
P

Peter Otten

jennyw said:
Thanks! Whare are the main advantages of HTMLParser over htmllib?

Basically htmllib.HTMLParser feeds a formatter that I don't need with
information that I would rather disregard.
HTMLParser.HTMLParser, on the other hand, has a simple interface (you've
pretty much seen it all in my tiny example).
The code gives me something to think about ... it doesn't work right now
because it turns out there are nested font tags (which means the asserts
fail, and if I comment them out, it generates a 53 MB file from a < 1 MB
source file). I'll try playing with it and seeing if I can get it to do
what I want.

I would suspect that there are <font> tags without a corresponding </font>.
You could fix that by preprocessing the html source with a tool like tidy.
As an aside, font tags as search criteria are as bad as you can get. Try to
find something more specific, e. g. the "second column in every row of the
first table". If this gets too complex for HTMLParser, you can instead
convert the html into xml (again via tidy) and then read it into a dom
tree.
It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?

I've never applied this primitive data extraction technique to large complex
html files, so for me a text editor has been sufficient so far.
(If you are on Linux, you could give Quanta Plus a try)


Peter

PS: You could ask the company supplying the catalog for a copy in a more
accessible format, assuming you are a customer rather than a competitor.
 
J

John J. Lee

jennyw said:
On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote: [...]
Thanks! Whare are the main advantages of HTMLParser over htmllib?

It won't choke on XHTML.


[...]
It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?

Not that I know of (google for it), but DOM is probably the easiest
way to make one. DOM libraries often have a prettyprint function to
(textually) print DOM nodes (eg. 4DOM from PyXML), which I've found
quite useful -- but of course that's just a chunk of the HTML nicely
reformatted as XHTML. Alternatively, you could use something like
graphviz / dot and some DOM-traversing code to make graphical trees.
Unfortunately, if this is HTML 'as deployed' (ie. unparseable junk),
you may have to run it through HTMLTidy before it goes into your DOM
parser (use mxTidy or uTidylib).


John
 
D

Dennis Lee Bieber

Mathias Waack fed this fish to the penguins on Wednesday 05 November
2003 00:21 am:
And whats the problem? HTML code produced by broken software like
Frontpage often contains unnecessary quotes - why do you wont to
preserve this crap?
Those look to be intentional "s -- marker for "inches"

7" x 3" -> 7 inches by 3 inches

--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top