Any reason why cStringIO in 2.5 behaves different from 2.4?

Stefan Scholl · Jul 26, 2007

After an hour searching for a potential bug in XML parsing
(PyXML), after updating from 2.4 to 2.5, I found this one:

$ python2.5
Python 2.5 (release25-maint, Dec 9 2006, 14:35:53)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-20)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
$ python
Python 2.4.4 (#2, Apr 5 2007, 20:11:18)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

OK, that's why my code was fine with Python 2.4 and breaks with
2.5.

{sigh}

Stefan Behnel · Jul 26, 2007

Stefan said:
After an hour searching for a potential bug in XML parsing
(PyXML), after updating from 2.4 to 2.5, I found this one:

$ python2.5
Python 2.5 (release25-maint, Dec 9 2006, 14:35:53)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-20)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in range(128)
$ python
Python 2.4.4 (#2, Apr 5 2007, 20:11:18)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

OK, that's why my code was fine with Python 2.4 and breaks with
2.5.

It wasn't fine with 2.4 either:

"""
Unlike the memory files implemented by the StringIO module, those provided by
this module are not able to accept Unicode strings that cannot be encoded as
plain ASCII strings.
"""

http://docs.python.org/lib/module-cStringIO.html

Read the docs...

Stefan

Stefan Scholl · Jul 26, 2007

Stefan Behnel said:
Stefan said:

After an hour searching for a potential bug in XML parsing
(PyXML), after updating from 2.4 to 2.5, I found this one:

$ python2.5
Python 2.5 (release25-maint, Dec 9 2006, 14:35:53)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-20)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import StringIO
x = StringIO.StringIO(u"m\xf6p")
import cStringIO
x = cStringIO.StringIO(u"m\xf6p")

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in range(128)
$ python
Python 2.4.4 (#2, Apr 5 2007, 20:11:18)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import StringIO
x = StringIO.StringIO(u"m\xf6p")
import cStringIO
x = cStringIO.StringIO(u"m\xf6p")

Click to expand...

OK, that's why my code was fine with Python 2.4 and breaks with
2.5.

Click to expand...

It wasn't fine with 2.4 either:

Worked in my test, a few lines above ...

"""
Unlike the memory files implemented by the StringIO module, those provided by
this module are not able to accept Unicode strings that cannot be encoded as
plain ASCII strings.
"""

http://docs.python.org/lib/module-cStringIO.html

Read the docs...

Well, http://docs.python.org/lib/module-xml.sax.html is missing
the fact, that I can't use Unicode with parseString().

This parseString() uses cStringIO.

Stefan Behnel · Jul 26, 2007

Stefan said:
Well, http://docs.python.org/lib/module-xml.sax.html is missing
the fact, that I can't use Unicode with parseString().

This parseString() uses cStringIO.

Well, Python unicode is not a valid *byte* encoding for XML.

lxml.etree can parse unicode, if you really want, but otherwise, you should
maybe stick to well-formed XML.

Stefan

Stefan Scholl · Jul 26, 2007

Stefan Behnel said:
Well, Python unicode is not a valid *byte* encoding for XML.

lxml.etree can parse unicode, if you really want, but otherwise, you should
maybe stick to well-formed XML.

The XML is well-formed. Works perfect in Python 2.4 with Python
unicode and Python sax parser.

This stays all inside Python.

Stefan Behnel · Jul 26, 2007

Stefan said:
The XML is well-formed. Works perfect in Python 2.4 with Python
unicode and Python sax parser.

The XML is *not* well-formed if you pass Python unicode instead of a byte
encoded string. Read the XML spec.

It would be well-formed if you added the proper XML declaration, but that is
system specific (UCS-4 or UTF-16, BE or LE). So don't even try.

Stefan

Stefan Scholl · Jul 26, 2007

Stefan Behnel said:
The XML is *not* well-formed if you pass Python unicode instead of a byte
encoded string. Read the XML spec.

It would be well-formed if you added the proper XML declaration, but that is
system specific (UCS-4 or UTF-16, BE or LE). So don't even try.

Who cares? I'm not calling any external tools.

Python should know its own strings.

Stefan Behnel · Jul 26, 2007

Stefan said:
Who cares? I'm not calling any external tools.

XML cares. If you want to work with something that is not XML, do not expect
XML tools to help you do it. XML tools work with XML, and there is a spec that
says what XML is. Your string is not XML.

Stefan

Stefan Scholl · Jul 26, 2007

Stefan Behnel said:
XML cares. If you want to work with something that is not XML, do not expect
XML tools to help you do it. XML tools work with XML, and there is a spec that
says what XML is. Your string is not XML.

This isn't some sophisticated XML tool that tells me the string
is wrong. It's a changed behavior of cStringIO that throws an
exception. While I'm just using the method parseString() of
xml.sax.

We both repeat ourselves. I don't think this thread brings
something new.

I'm all for correct XML and hate XML bozos. But there are limits
you have to learn after a few years.

Stefan Behnel · Jul 26, 2007

Stefan said:
This isn't some sophisticated XML tool that tells me the string
is wrong. It's a changed behavior of cStringIO that throws an
exception. While I'm just using the method parseString() of
xml.sax.

All I'm saying is that parseString() is perfectly right in using cStringIO, as
cStringIO supports every possible incarnation of serialised XML.

It was documented that cStringIO does not support Unicode and it doesn't:

$ python2.4
Python 2.4.4 (#2, Apr 12 2007, 21:03:11)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\uf852' in
position 0: ordinal not in range(128)

What a surprise.

Stefan

Stefan Scholl · Jul 26, 2007

Stefan Behnel said:
The XML is *not* well-formed if you pass Python unicode instead of a byte
encoded string. Read the XML spec.

Pointers, please.

Last time I read that part of the spec was when a customer's
consulting company switched to ISO-8859-15 without saying
something beforehand. The old code (PHP) I have to maintain
couldn't deal with it.

It was wrong to switch encoding without telling somebody about
it. And a XML processor isn't required to support ISO-8859-15.
But I thought it was too embarrassing not to support this
encoding. I fixed that part without making a fuss.

A Python XML processor that can't handle the own encoding is
embarrassing. It isn't required to support it. It would be OK if
it wouldn't support UTF-7. But a parseString() method that
doesn't want Python strings? No way!

Chris Mellon · Jul 26, 2007

Pointers, please.

Last time I read that part of the spec was when a customer's
consulting company switched to ISO-8859-15 without saying
something beforehand. The old code (PHP) I have to maintain
couldn't deal with it.

It was wrong to switch encoding without telling somebody about
it. And a XML processor isn't required to support ISO-8859-15.
But I thought it was too embarrassing not to support this
encoding. I fixed that part without making a fuss.

A Python XML processor that can't handle the own encoding is
embarrassing. It isn't required to support it. It would be OK if
it wouldn't support UTF-7. But a parseString() method that
doesn't want Python strings? No way!

Of course it can handle its own encoding. But you're passing incorrect
values to it, the same way that passing '10' to a function expecting
an int is going to fail.

cStringIO in python 2.4 is buggy - when passed a unicode object, it
silently uses the (platform and compilation dependent) internal buffer
of the unicode object. In 2.5 this was corrected to be consistent with
all other unicode/str conversions and encode it using the default
encoding, failing when that's not possible (as in your example).

It's not that your code worked on 2.4, and 2.5 broke it - the 2.4 code
was subtly buggy and 2.5 is preventing you from having that bug.

XML is not a string. It's a specific type of bytestream. If you want
to work with XML, then generate well-formed XML in the correct
encoding. There's no reason you should have an XML document (as
opposed to values extracted from that document) in unicode objects at
all.

Stefan Scholl · Jul 27, 2007

Chris Mellon said:
XML is not a string. It's a specific type of bytestream. If you want
to work with XML, then generate well-formed XML in the correct
encoding. There's no reason you should have an XML document (as
opposed to values extracted from that document) in unicode objects at
all.

The affected method in xml.sax is called parseString()

Stefan Behnel · Jul 27, 2007

Stefan said:
Pointers, please.

There you have it:

http://www.w3.org/TR/xml/#charencoding

"""
In the absence of information provided by an external transport protocol (e.g.
HTTP or MIME), it is a *fatal error* for an entity including an encoding
declaration to be presented to the XML processor in an encoding other than
that named in the declaration, or for an entity which begins with neither a
Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.
"""

"""
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with
the Byte Order Mark ...
"""

Python does not use BOMs internally (although that again may be platform
specific). You might argue that there is some kind of "external transportation
protocol" as it is a Python Unicode string (I used that excuse when I
implemented Unicode parsing support in lxml), but Python's Unicode objects are
strictly a character stream, not a byte stream. XML is only defined for
streams of bytes.

Also, there is no requirement for an XML processor to be able to parse
anything but UTF-8 and UTF-16. Especially if the encoding is *undefined* and
*platform-specific*, as that of a Python Unicode string.

Anything else I can help you understanding?

Stefan

Marc 'BlackJack' Rintsch · Jul 27, 2007

The affected method in xml.sax is called parseString()

Exactly. It's *not* called `parseUnicode()`.

In Python you have the types `str` which is a bunch of bytes and `unicode`
which is a character string that can hold unicode characters. How those
are internally represented is an implementation detail. You shouldn't
know or depend on the internal representation, other modules like XML
parsers shouldn't do either.

XML, the serialized form, is about bytes in some encoding. So this can
only be stored in `str` objects. `unicode` is already decoded. If you
want to feed `unicode` objects to an XML parser, simply encode it before
passing it.

The question remains why you have "serialized XML" as `unicode` in the
first place as it is about bytes not unicode characters.

Ciao,
Marc 'BlackJack' Rintsch

Stefan Scholl · Jul 27, 2007

Stefan Behnel said:
There you have it:

http://www.w3.org/TR/xml/#charencoding

"""
In the absence of information provided by an external transport protocol (e.g.
HTTP or MIME), it is a *fatal error* for an entity including an encoding
Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.
"""

specific). You might argue that there is some kind of "external transportation
protocol" as it is a Python Unicode string (I used that excuse when I

Not the string itself. The method that ist called: parseString()

Stefan Scholl · Jul 27, 2007

Marc 'BlackJack' Rintsch said:
Exactly. It's *not* called `parseUnicode()`.
Semantics.

Stefan Behnel · Jul 27, 2007

Stefan said:
<type 'basestring'>

And it's not called "parseBasestring" either.

Stefan

Chris Mellon · Jul 27, 2007

The affected method in xml.sax is called parseString()

The imprecision of the english language has caused greater problems
than this. Since you've now had everything clarified for you, and the
imprecision is resolved, I'm sure that this won't be a problem again.

Stefan Scholl · Jul 28, 2007

Chris Mellon said:
The imprecision of the english language has caused greater problems
than this. Since you've now had everything clarified for you, and the
imprecision is resolved, I'm sure that this won't be a problem again.

Right. I now know that xml.sax's parseString() has undocumented
implementation dependent behavior. That there are libraries (not
included with Python) which can parse Unicode strings. And that
the reason to change cStringIO's behavior is acceptable.

But the style of the answers makes me wonder if I should report
the bug in xml.sax (or its documentation) or just ignore it.

how to get full-text search of pysqlite3 work in python2.5.2	4	Oct 30, 2008
str() should convert ANY object to a string without EXCEPTIONS !	18	Sep 28, 2008
cStringIO unicode weirdness	3	Jun 18, 2007
cStringIO change in 2.4 vs 2.5. Regression?	3	Jun 1, 2007
Garbage collection	23	Mar 21, 2007
CTypes on a 64 bit machine truncates pointers to 32 bits?	3	Sep 19, 2008
Pickling exception object works in 2.4 but not 2.5	0	Jun 6, 2008
Performance of Python 2.3 and 2.4	2	Apr 22, 2006

Any reason why cStringIO in 2.5 behaves different from 2.4?

Stefan Scholl

Stefan Behnel

Stefan Scholl

Stefan Behnel

Stefan Scholl

Stefan Behnel

Stefan Scholl

Stefan Behnel

Stefan Scholl

Stefan Behnel

Stefan Scholl

Chris Mellon

Stefan Scholl

Stefan Behnel

Marc 'BlackJack' Rintsch

Stefan Scholl

Stefan Scholl

Stefan Behnel

Chris Mellon

Stefan Scholl

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads