Py 2.5: Bug in sgmllib

Michael Butscher · Oct 22, 2006

Hi,

if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')

I get the exception:

Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
self.goahead(0)
File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in
parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0:
ordinal not in range(128)

The reason is that the character reference ß is converted to
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte
string to the remaining unicode string fails.

Workaround (not thoroughly tested): Override convert_codepoint in a
derived class with:

def convert_codepoint(self, codepoint):
return unichr(codepoint)

Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?

Michael

Fredrik Lundh · Oct 22, 2006

Michael Butscher wrote:

if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')

source documents are encoded byte streams, not decoded Unicode
sequences. I suggest reading up on how Python's Unicode string
type is, and what a Unicode string represents. it's not the same
thing as a byte string.

</F>

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 22, 2006

Michael said:
Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?

In a sense, SGML itself is not meant to be used for Unicode. In SGML,
the document character set is subject to the SGML application. So what
specific character a character reference refers to is also subject to
the SGML application.

This entire issue is already documented; see the discussion of
convert_charref and convert_codepoint in

http://docs.python.org/lib/module-sgmllib.html

Regards,
Martin

sgmllib bug in Python 2.5, works in 2.4.	2	Feb 5, 2007
sgmllib problem & proposed fix.	1	Dec 17, 2004
Python HTML parser chokes on UTF-8 input	5	Oct 9, 2008
HTMLParser and non-ascii html pages	0	Sep 20, 2011
Problem pickling exceptions in Python 2.5/2.6	0	Jun 8, 2008
Buffer Overflow with Python 2.5 on Vista in import site	2	Mar 29, 2008
bad marshal data in site.py in fresh 2.5 install win	5	Dec 29, 2006
html parser , unexpected '<' char in declaration	9	Feb 20, 2006

Py 2.5: Bug in sgmllib

Michael Butscher

Fredrik Lundh

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads