Codec lookup fails for bad codec name, blowing up BeautifulSoup

Discussion in 'Python' started by John Nagle, Nov 9, 2007.

  1. John Nagle

    John Nagle Guest

    I just had our web page parser fail on "www.nasa.gov".
    It seems that NASA returns an HTTP header with a charset of ".utf8", which
    is non-standard. This goes into BeautifulSoup, which blows up trying to
    find a suitable codec.

    This happens because BeautifulSoup does this:

    def _codec(self, charset):
    if not charset: return charset
    codec = None
    try:
    codecs.lookup(charset)
    codec = charset
    except LookupError:
    pass
    return codec

    The documentation for codecs.lookup says:

    lookup(encoding)
    Looks up a codec tuple in the Python codec registry and returns
    the function tuple as defined above.

    Encodings are first looked up in the registry's cache. If not found,
    the list of registered search functions is scanned.
    If no codecs tuple is found, a LookupError is raised.

    So BeautifulSoup's lookup ought to be safe, right? Wrong.
    What actually happens is a ValueError exception:

    File "./sitetruth/BeautifulSoup.py", line 1770, in _codec
    codecs.lookup(charset)
    File "/usr/local/lib/python2.5/encodings/__init__.py", line 97,
    in search_function
    globals(), locals(), _import_tail)
    ValueError: Empty module name

    This is a known bug. It's in the old tracker on SourceForge:
    [ python-Bugs-960874 ] codecs.lookup can raise exceptions other
    than LookupError
    but not in the new tracker.

    The "resolution" back in 2004 was "Won't Fix", without a change
    to the documentation. Grrr.

    Patched BeautifulSoup to work around the problem:

    def _codec(self, charset):
    if not charset: return charset
    codec = None
    try:
    codecs.lookup(charset)
    codec = charset
    except (LookupError, ValueError):
    pass
    return codec


    John Nagle
     
    John Nagle, Nov 9, 2007
    #1
    1. Advertising


  2. >
    > This is a known bug. It's in the old tracker on SourceForge:
    > [ python-Bugs-960874 ] codecs.lookup can raise exceptions other
    > than LookupError
    > but not in the new tracker.


    The new tracker has it too.
    http://bugs.python.org/issue960874

    >
    > The "resolution" back in 2004 was "Won't Fix", without a change
    > to the documentation. Grrr.
    >
     
    Waldemar Osuch, Nov 9, 2007
    #2
    1. Advertising

  3. John Nagle

    John Nagle Guest

    Waldemar Osuch wrote:
    >> This is a known bug. It's in the old tracker on SourceForge:
    >> [ python-Bugs-960874 ] codecs.lookup can raise exceptions other
    >> than LookupError
    >> but not in the new tracker.

    >
    > The new tracker has it too.
    > http://bugs.python.org/issue960874


    How did you find that? I put "codecs.lookup" into the tracker's
    search box, and it returned five hits, but not that one.

    John Nagle
     
    John Nagle, Nov 9, 2007
    #3
  4. On Nov 9, 4:15 pm, John Nagle <> wrote:
    > Waldemar Osuch wrote:
    > >> This is a known bug. It's in the old tracker on SourceForge:
    > >> [ python-Bugs-960874 ] codecs.lookup can raise exceptions other
    > >> than LookupError
    > >> but not in the new tracker.

    >
    > > The new tracker has it too.
    > >http://bugs.python.org/issue960874

    >
    > How did you find that? I put "codecs.lookup" into the tracker's
    > search box, and it returned five hits, but not that one.
    >
    > John Nagle


    I have seen this explained on this list once:
    http://bugs.python.org/issues + <source forge bug id>
    points to the converted ticket.
    And yes the search could be better.
     
    Waldemar Osuch, Nov 10, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ashish
    Replies:
    1
    Views:
    551
    bruce barker
    Nov 17, 2003
  2. rao
    Replies:
    1
    Views:
    3,289
  3. asj
    Replies:
    138
    Views:
    2,447
    William Brogden
    Sep 1, 2003
  4. John Nagle
    Replies:
    0
    Views:
    338
    John Nagle
    May 15, 2008
  5. rantingrick
    Replies:
    44
    Views:
    1,224
    Peter Pearson
    Jul 13, 2010
Loading...

Share This Page