sgmllib bug in Python 2.5, works in 2.4.

Discussion in 'Python' started by John Nagle, Feb 5, 2007.

  1. John Nagle

    John Nagle Guest

    (Was prevously posted as a followup to something else by accident.)

    I'm running a website page through BeautifulSoup. It parses OK
    with Python 2.4, but Python 2.5 fails with an exception:

    Traceback (most recent call last):
    File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
    self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
    File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
    File "./sitetruth/BeautifulSoup.py", line 973, in __init__
    self._feed()
    File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    SGMLParser.feed(self, markup or "")
    File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
    File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
    File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
    File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
    self.handle_starttag(tag, method, attrs)
    File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
    method(attrs)
    File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
    self._feed(self.declaredHTMLEncoding)
    File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    SGMLParser.feed(self, markup or "")
    File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
    File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
    File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal
    not in range(128)

    The code that's failing is in "_convert_ref", which is new in Python 2.5.
    That function wasn't present in 2.4. I think the code is trying to
    handle single quotes inside of double quotes, or something like that.

    To replicate, run

    http://www.bankofamerica.com
    or
    http://www.gm.com

    through BeautifulSoup.

    Something about this code doesn't like big companies. Web sites of smaller
    companies are going through OK.

    Also reported as a bug:

    [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5


    John Nagle
     
    John Nagle, Feb 5, 2007
    #1
    1. Advertising

  2. John Nagle

    Stefan Rank Guest

    on 05.02.2007 03:49 John Nagle said the following:
    > (Was prevously posted as a followup to something else by accident.)
    >
    > I'm running a website page through BeautifulSoup. It parses OK
    > with Python 2.4, but Python 2.5 fails with an exception:
    >
    > Traceback (most recent call last):
    > File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
    > self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
    > File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
    > BeautifulStoneSoup.__init__(self, *args, **kwargs)
    > File "./sitetruth/BeautifulSoup.py", line 973, in __init__
    > self._feed()
    > File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    > SGMLParser.feed(self, markup or "")
    > File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    > self.goahead(0)
    > File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    > k = self.parse_starttag(i)
    > File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    > self.finish_starttag(tag, attrs)
    > File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
    > self.handle_starttag(tag, method, attrs)
    > File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
    > method(attrs)
    > File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
    > self._feed(self.declaredHTMLEncoding)
    > File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    > SGMLParser.feed(self, markup or "")
    > File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    > self.goahead(0)
    > File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    > k = self.parse_starttag(i)
    > File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    > self._convert_ref, attrvalue)
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal
    > not in range(128)
    >
    > The code that's failing is in "_convert_ref", which is new in Python 2.5.
    > That function wasn't present in 2.4. I think the code is trying to
    > handle single quotes inside of double quotes, or something like that.
    >
    > To replicate, run
    >
    > http://www.bankofamerica.com
    > or
    > http://www.gm.com
    >
    > through BeautifulSoup.
    >
    > Something about this code doesn't like big companies. Web sites of smaller
    > companies are going through OK.
    >
    > Also reported as a bug:
    >
    > [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5
    >
    >
    > John Nagle


    Hi,

    I had a similar problem recently and did not have time to file a
    bug-report. Thanks for doing that.

    The problem is the code that handles entity and character references in
    SGMLParser.parse_starttag. Seems that it is not careful about
    unicode/str issues.

    My quick'n'dirty workaround was to remove the offending char-entity from
    the website before feeding it to Beautifulsoup::

    text = text.replace('®', '') # remove rights reserved sign entity

    cheers,
    stefan
     
    Stefan Rank, Feb 5, 2007
    #2
    1. Advertising

  3. John Nagle

    John Nagle Guest

    John Nagle wrote:
    > (Was prevously posted as a followup to something else by accident.)
    >
    > I'm running a website page through BeautifulSoup. It parses OK
    > with Python 2.4, but Python 2.5 fails with an exception:
    >
    > Traceback (most recent call last):
    > File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
    > self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into
    > tree form
    > File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
    > BeautifulStoneSoup.__init__(self, *args, **kwargs)
    > File "./sitetruth/BeautifulSoup.py", line 973, in __init__
    > self._feed()
    > File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    > SGMLParser.feed(self, markup or "")
    > File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    > self.goahead(0)
    > File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    > k = self.parse_starttag(i)
    > File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    > self.finish_starttag(tag, attrs)
    > File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
    > self.handle_starttag(tag, method, attrs)
    > File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
    > method(attrs)
    > File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
    > self._feed(self.declaredHTMLEncoding)
    > File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    > SGMLParser.feed(self, markup or "")
    > File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    > self.goahead(0)
    > File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    > k = self.parse_starttag(i)
    > File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    > self._convert_ref, attrvalue)
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0:
    > ordinal
    > not in range(128)
    >
    > The code that's failing is in "_convert_ref", which is new in
    > Python 2.5.
    > That function wasn't present in 2.4. I think the code is trying to
    > handle single quotes inside of double quotes, or something like that.
    >
    > To replicate, run
    >
    > http://www.bankofamerica.com
    > or
    > http://www.gm.com
    >
    > through BeautifulSoup.
    >
    > Something about this code doesn't like big companies. Web sites of smaller
    > companies are going through OK.
    >
    > Also reported as a bug:
    >
    > [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5


    Found the problem and updated the bug report with a fix. But someone
    else will have to check it in.

    There's a place in SGMLParser where someone assumed that values 0..255
    were valid ASCII characters. But in fact the allowed range is 0..127.
    The effect is that Unicode strings containing values between 128 and 255
    will blow up SGMLParser.

    In fact, you can even make this happen with an ASCII
    source file by using an HTML entity which has a Unicode representation
    between 128 and 255, (such as "§"), then using something
    Unicode-oriented like BeautifulSoup on it.

    John Nagle
     
    John Nagle, Feb 7, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. C. Titus Brown

    sgmllib problem & proposed fix.

    C. Titus Brown, Dec 17, 2004, in forum: Python
    Replies:
    1
    Views:
    363
    C. Titus Brown
    Dec 17, 2004
  2. Harlin Seritt

    SGMLlib module

    Harlin Seritt, May 8, 2005, in forum: Python
    Replies:
    3
    Views:
    333
    John J. Lee
    May 8, 2005
  3. Sakcee
    Replies:
    1
    Views:
    312
  4. Richard Hsu
    Replies:
    2
    Views:
    288
    Richard Hsu
    Apr 12, 2006
  5. Michael Butscher

    Py 2.5: Bug in sgmllib

    Michael Butscher, Oct 22, 2006, in forum: Python
    Replies:
    2
    Views:
    314
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Oct 22, 2006
Loading...

Share This Page