Any reason why cStringIO in 2.5 behaves different from 2.4?

S

Stefan Behnel

Stefan said:
Right. I now know that xml.sax's parseString() has undocumented
implementation dependent behavior. That there are libraries (not
included with Python) which can parse Unicode strings. And that
the reason to change cStringIO's behavior is acceptable.

But the style of the answers makes me wonder if I should report
the bug in xml.sax (or its documentation) or just ignore it.

Note that PyXML is no longer actively maintained, so it's unlikely that
reporting the bug would get you a version that raises an exception when
passing a unicode string *independent of the Python version*.

Besides, the bug has been fixed in Python 2.5 already.

Stefan
 
S

Stefan Scholl

Stefan Behnel said:
Note that PyXML is no longer actively maintained, so it's unlikely that

Too bad it can still be found in at least 80 % of all XML
examples. I should burn the XML chapter of the cookbook.

By the way: Thanks for the tip regarding LXML. I'll try to
remember it for the next project.

reporting the bug would get you a version that raises an exception when
passing a unicode string *independent of the Python version*.

Besides, the bug has been fixed in Python 2.5 already.

Just checked on a system without PyXML: xml/sax/__init__.py
defines parseString() and uses cStringIO (when available).

Python 2.5.1
 
C

Chris Mellon

Too bad it can still be found in at least 80 % of all XML
examples. I should burn the XML chapter of the cookbook.

By the way: Thanks for the tip regarding LXML. I'll try to
remember it for the next project.



Just checked on a system without PyXML: xml/sax/__init__.py
defines parseString() and uses cStringIO (when available).

Python 2.5.1

Yes, thats the fixed bug. After all this you still do not seem to be
clear on what the bug is. The bug is not that your code fails in 2.5,
it's that it worked at all in 2.4.
 
S

Stefan Scholl

Chris Mellon said:
Yes, thats the fixed bug. After all this you still do not seem to be
clear on what the bug is. The bug is not that your code fails in 2.5,
it's that it worked at all in 2.4.

Don't let the subject line fool you. I'm OK with cStringIO. The
thread is now about xml.sax's parseString().
 
S

Stefan Behnel

Stefan said:
Don't let the subject line fool you. I'm OK with cStringIO. The
thread is now about xml.sax's parseString().

.... which works correctly with Python 2.5 and was broken before.

Stefan
 
M

Michael L Torrie

Stefan said:
Don't let the subject line fool you. I'm OK with cStringIO. The
thread is now about xml.sax's parseString().

Giving you the benefit of the doubt here, despite the fact that Stefan
Behnel has state this over and over again and you just haven't listened.

xml.sax's use of parseString() is exactly correct. xml.sax should
*never* parse python unicode strings as by definition XML must be
encoded as a *byte stream*, which is what a python string is.

A python /unicode/ string could be held internally in any number of
ways, 2, 3, 4, or even 8 bytes per character if the implementation
demanded it (a bit contrived, I admit). Since the xml parser is only
ever intended to parse *XML*, why should it ever know what to do with
python unicode strings, which could be stored any number of ways, making
byte-parsing impossible.

So your code is faulty in its assumptions, not xml.sax.
 
S

Stefan Scholl

Michael L Torrie said:
Giving you the benefit of the doubt here, despite the fact that Stefan
Behnel has state this over and over again and you just haven't listened.

Speaking of over and over again ...

xml.sax's use of parseString() is exactly correct. xml.sax should
*never* parse python unicode strings as by definition XML must be
encoded as a *byte stream*, which is what a python string is.

I don't care about the definition of XML at this point of the
program. http://docs.python.org/lib/module-xml.sax.html calls
parseString() a convenience function.

This is Python. Python has a class named unicode. Its literals
look like strings. The base class is basestring.

xml.sax belongs to Python. Batteries included. parseString() is
in Python.

It's not parseString() that tells me something is wrong with the
parameter. It's cStringIO, which is used on platforms where it is
available. On other platforms no exceptions are thrown, because
then StringIO is used, which behaves in Python 2.4 and Python 2.5
the same, regarding unicode strings.

Other libraries like LXML (not included) parse unicode strings.


And these are two additional lines in my code now:

if isinstance(string, unicode):
string = string.encode("utf-8")

A python /unicode/ string could be held internally in any number of
ways, 2, 3, 4, or even 8 bytes per character if the implementation
demanded it (a bit contrived, I admit). Since the xml parser is only
ever intended to parse *XML*, why should it ever know what to do with
python unicode strings, which could be stored any number of ways, making
byte-parsing impossible.

xml.sax is no external parser. The program doesn't have to
communicate with the outside world at this point of execution.
The Python programm calls a Python function of a Python class and
passes a Python unicode string as parameter.

XML parsers only have to support few encodings. But nobody has
something against it when they support more than that.

A Python convenience function isn't broken when it allows Python
unicode strings.


The behavior of cStringIO (the original topic of this thread) is
correct and documented. parseString() uses the old idiom where
cStringIO is imported as StringIO, when available. Despite the
fact that they behave differently.

In my personal opinion: If parseString() shouldn't support
unicode strings, then it should check for it and throw a
meaningful exception.

At the moment the code just looks as if someone has overlooked
the fact that unicode strings (with non-ascii characters in it)
cause a problem. Missing test?

So your code is faulty in its assumptions, not xml.sax.

As I said in the conclusion, a few messages before: Undocumented,
implementation dependent behavior.

Or maybe just a bug, considering the following on
http://docs.python.org/lib/module-xml.sax.html

A typical SAX application uses three kinds of objects:
readers, handlers and input sources. ``Reader'' in this
context is another term for parser, i.e. some piece of
code that reads the bytes or characters from the input
source, and produces a sequence of events.


Bytes _or_ characters.
 
S

Stefan Behnel

Stefan said:
I don't care about the definition of XML at this point of the
program.

Maybe you should, as you're dealing with XML. But then, maybe you're lucky and
no-one will have to use your software.

It's not parseString() that tells me something is wrong with the
parameter. It's cStringIO, which is used on platforms where it is
available. On other platforms no exceptions are thrown, because
then StringIO is used, which behaves in Python 2.4 and Python 2.5
the same, regarding unicode strings.

Other libraries like LXML (not included) parse unicode strings.

But only in very well defined cases. lxml actually checks if the string uses
an XML declaration. And if it does, you will get an exception.

lxml's parser only accepts a unicode object as input if the unicode object is
the only way to determine the encoding. That's lxml's interpretation of
transport-provided encoding information, which is allowed by the spec.

If you want PyXML to behave the same, go ahead and send a patch to python-dev
or xml-sig.

And these are two additional lines in my code now:

if isinstance(string, unicode):
string = string.encode("utf-8")

This will only work as long as your XML does not have an encoding declaration.
Does your code guarantee that? Because otherwise it is broken. Have you
documented that?

xml.sax is no external parser.

Right, it's a package. But it contains an *XML* parser.

The program doesn't have to
communicate with the outside world at this point of execution.
The Python programm calls a Python function of a Python class and
passes a Python unicode string as parameter.

A sequence of unicode characters, right.

Why not just pass XML? Would make your life easier.

The behavior of cStringIO (the original topic of this thread) is
correct and documented. parseString() uses the old idiom where
cStringIO is imported as StringIO, when available. Despite the
fact that they behave differently.

In my personal opinion: If parseString() shouldn't support
unicode strings, then it should check for it and throw a
meaningful exception.

Well, it does. But not because you passed a unicode string but because you
passed a unicode string that does not map 1:1 to the standard XML 1-byte
encoding. A sequence of plain ASCII characters passed as a unicode string is
perfectly ok.

So, the API is even forgiving enough to accept unicode strings, it just obeys
the XML spec after that, that's all.

At the moment the code just looks as if someone has overlooked
the fact that unicode strings (with non-ascii characters in it)
cause a problem. Missing test?

No, just wrong assumption on your side. Read the spec, learn, think, understand.

As I said in the conclusion, a few messages before: Undocumented,
implementation dependent behavior.

Well, the implementation was correct *under the assumption* that cStringIO
behaved as expected. But as cStringIO deviated from its documentation,
xml.sax.parseString could not work as expected.

Or maybe just a bug, considering the following on
http://docs.python.org/lib/module-xml.sax.html

A typical SAX application uses three kinds of objects:
readers, handlers and input sources. ``Reader'' in this
context is another term for parser, i.e. some piece of
code that reads the bytes or characters from the input
source, and produces a sequence of events.


Bytes _or_ characters.

I think they were just trying to have more people understand what they wanted
to say, not only those who know XML.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,206
Latest member
SybilSchil

Latest Threads

Top