Codec lookup fails for bad codec name, blowing up BeautifulSoup

J

John Nagle

I just had our web page parser fail on "www.nasa.gov".
It seems that NASA returns an HTTP header with a charset of ".utf8", which
is non-standard. This goes into BeautifulSoup, which blows up trying to
find a suitable codec.

This happens because BeautifulSoup does this:

def _codec(self, charset):
if not charset: return charset
codec = None
try:
codecs.lookup(charset)
codec = charset
except LookupError:
pass
return codec

The documentation for codecs.lookup says:

lookup(encoding)
Looks up a codec tuple in the Python codec registry and returns
the function tuple as defined above.

Encodings are first looked up in the registry's cache. If not found,
the list of registered search functions is scanned.
If no codecs tuple is found, a LookupError is raised.

So BeautifulSoup's lookup ought to be safe, right? Wrong.
What actually happens is a ValueError exception:

File "./sitetruth/BeautifulSoup.py", line 1770, in _codec
codecs.lookup(charset)
File "/usr/local/lib/python2.5/encodings/__init__.py", line 97,
in search_function
globals(), locals(), _import_tail)
ValueError: Empty module name

This is a known bug. It's in the old tracker on SourceForge:
[ python-Bugs-960874 ] codecs.lookup can raise exceptions other
than LookupError
but not in the new tracker.

The "resolution" back in 2004 was "Won't Fix", without a change
to the documentation. Grrr.

Patched BeautifulSoup to work around the problem:

def _codec(self, charset):
if not charset: return charset
codec = None
try:
codecs.lookup(charset)
codec = charset
except (LookupError, ValueError):
pass
return codec


John Nagle
 
W

Waldemar Osuch

Waldemar said:
This is a known bug. It's in the old tracker on SourceForge:
[ python-Bugs-960874 ] codecs.lookup can raise exceptions other
than LookupError
but not in the new tracker.

How did you find that? I put "codecs.lookup" into the tracker's
search box, and it returned five hits, but not that one.

John Nagle

I have seen this explained on this list once:
http://bugs.python.org/issues + <source forge bug id>
points to the converted ticket.
And yes the search could be better.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top