Another BeautifulSoup crash on bad HTML

J

John Nagle

Can't really blame BeautifulSoup for this, but our crawler hit a page
("http://clagnut.com/privacy/") with an out of range character escape:



in this text:

If you provide a name, email address and/or website and choose ‘Remember
me, these details will be stored as a cookie on your computer.

The author clearly meant "’", which is a single close quote.

The traceback as BeautifulSoup aborts:

SGMLParser.feed(self, markup or "")
File "/usr/local/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/local/lib/python2.5/sgmllib.py", line 181, in goahead
self.handle_charref(name)
File "/var/www/vhosts/sitetruth.com/cgi-bin/sitetruth/BeautifulSoup.py", line
1250, in handle_charref
data = unichr(int(ref))
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Another item in our ongoing saga of "What happens when you parse real-world
HTML".

A try-block in handle_charref would be appropriate.

John Nagle
SiteTruth
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,048
Members
48,769
Latest member
Clifft

Latest Threads

Top