Py 2.5: Bug in sgmllib

M

Michael Butscher

Hi,

if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')



I get the exception:

Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
self.goahead(0)
File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in
parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0:
ordinal not in range(128)



The reason is that the character reference ß is converted to
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte
string to the remaining unicode string fails.


Workaround (not thoroughly tested): Override convert_codepoint in a
derived class with:

def convert_codepoint(self, codepoint):
return unichr(codepoint)



Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?



Michael
 
F

Fredrik Lundh

Michael Butscher wrote:

if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')

source documents are encoded byte streams, not decoded Unicode
sequences. I suggest reading up on how Python's Unicode string
type is, and what a Unicode string represents. it's not the same
thing as a byte string.

</F>
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Michael said:
Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?

In a sense, SGML itself is not meant to be used for Unicode. In SGML,
the document character set is subject to the SGML application. So what
specific character a character reference refers to is also subject to
the SGML application.

This entire issue is already documented; see the discussion of
convert_charref and convert_codepoint in

http://docs.python.org/lib/module-sgmllib.html

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top