Py 2.5: Bug in sgmllib

Discussion in 'Python' started by Michael Butscher, Oct 22, 2006.

  1. Hi,

    if I execute the following two lines in Python 2.5 (to feed in a
    *unicode* string):

    import sgmllib
    sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')



    I get the exception:

    Traceback (most recent call last):
    File "<pyshell#10>", line 1, in <module>
    sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
    File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
    self.goahead(0)
    File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
    File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in
    parse_starttag
    self._convert_ref, attrvalue)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0:
    ordinal not in range(128)



    The reason is that the character reference ß is converted to
    *byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte
    string to the remaining unicode string fails.


    Workaround (not thoroughly tested): Override convert_codepoint in a
    derived class with:

    def convert_codepoint(self, codepoint):
    return unichr(codepoint)



    Is this a bug or is SGMLParser not meant to be used for unicode strings
    (it should be documented then)?



    Michael
    Michael Butscher, Oct 22, 2006
    #1
    1. Advertising

  2. Michael Butscher wrote:


    > if I execute the following two lines in Python 2.5 (to feed in a
    > *unicode* string):
    >
    > import sgmllib
    > sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')


    source documents are encoded byte streams, not decoded Unicode
    sequences. I suggest reading up on how Python's Unicode string
    type is, and what a Unicode string represents. it's not the same
    thing as a byte string.

    </F>
    Fredrik Lundh, Oct 22, 2006
    #2
    1. Advertising

  3. Michael Butscher schrieb:
    > Is this a bug or is SGMLParser not meant to be used for unicode strings
    > (it should be documented then)?


    In a sense, SGML itself is not meant to be used for Unicode. In SGML,
    the document character set is subject to the SGML application. So what
    specific character a character reference refers to is also subject to
    the SGML application.

    This entire issue is already documented; see the discussion of
    convert_charref and convert_codepoint in

    http://docs.python.org/lib/module-sgmllib.html

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Oct 22, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. C. Titus Brown

    sgmllib problem & proposed fix.

    C. Titus Brown, Dec 17, 2004, in forum: Python
    Replies:
    1
    Views:
    359
    C. Titus Brown
    Dec 17, 2004
  2. Harlin Seritt

    SGMLlib module

    Harlin Seritt, May 8, 2005, in forum: Python
    Replies:
    3
    Views:
    327
    John J. Lee
    May 8, 2005
  3. Sakcee
    Replies:
    1
    Views:
    308
  4. Richard Hsu
    Replies:
    2
    Views:
    286
    Richard Hsu
    Apr 12, 2006
  5. John Nagle
    Replies:
    2
    Views:
    355
    John Nagle
    Feb 7, 2007
Loading...

Share This Page