Mysterious xml.sax Encoding Exception

JKPeck · Feb 1, 2008

I have a module that uses xml.sax and feeds it a string of xml as in
xml.sax.parseString(dictfile,handler)

The xml is always encoded in utf-16, and the XML string always starts
with
<?xml version="1.0" encoding="UTF-16" standalone="no"?>

This almost always works fine, but two users of this module get an
exception whatever input they use it on. (The actual xml is generated
by an api in our application that returns an xml version of metadata
associated with the application's data.)

The exception is
xml.sax._exceptions.SAXParseException: <unknown>:1:30: encoding
specified in XML declaration is incorrect.

In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

Now here is the hard part: This never happens to me, and having gotten
the actual xml content from one of the users and fed it to the parser,
I don't get the exception.

What could be going on? We are all on Python 2.5 (and all on an
English locale).

Any suggestions would be appreciated.
-Jon Peck

Martin v. Löwis · Feb 1, 2008

In both of these cases, there are only plain, 7-bit ascii characters

in the xml, and it really is valid utf-16 as far as I can tell.

What do you mean by "7-bit ascii characters"? If it means what I think
it means (namely, a sequence of bytes whose values are between 1 and
127), then it is *not* valid utf-16.

Now here is the hard part: This never happens to me, and having gotten
the actual xml content from one of the users and fed it to the parser,
I don't get the exception.

What could be going on? We are all on Python 2.5 (and all on an
English locale).

What operating system do they use, and how do they send you the file
for verification? Can you have them run

print repr(open(filename, "rb").read(10))

and send you its output?

Regards,
Martin

JKPeck · Feb 1, 2008

What do you mean by "7-bit ascii characters"? If it means what I think
it means (namely, a sequence of bytes whose values are between 1 and
127), then it is *not* valid utf-16.

What operating system do they use, and how do they send you the file
for verification? Can you have them run

print repr(open(filename, "rb").read(10))

and send you its output?

Regards,
Martin

They sent me the actual file, which was created on Windows, as an
email attachment. They had also sent the actual dataset from which
the XML was generated so that I could generate it myself using the
same version of our app as the user has. I did that but did not get
an exception.

Martin v. Löwis · Feb 1, 2008

They sent me the actual file, which was created on Windows, as an

email attachment. They had also sent the actual dataset from which
the XML was generated so that I could generate it myself using the
same version of our app as the user has. I did that but did not get
an exception.

So are you sure you open the file in binary mode on Windows?

Regards,
Martin

JKPeck · Feb 1, 2008

So are you sure you open the file in binary mode on Windows?

Regards,
Martin

In the real case, the xml never goes through a file but is handed
directly to the parser. The api return a Python Unicode string
(utf-16). For the file the user sent, if I open it in binary mode, it
still has a BOM; otherwise the BOM is removed. But either version
works on my system.

The basic fact, though, remains, the same code works for me with the
same input but not for two particular users (out of hundreds).

Regards,
Jon

Martin v. Löwis · Feb 2, 2008

The basic fact, though, remains, the same code works for me with the

same input but not for two particular users (out of hundreds).

I see. That's mysterious.

Regards,
Martin

Jeroen Ruigrok van der Werven · Feb 2, 2008

-On [20080201 19:06] said:
In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?

John Machin · Feb 2, 2008

In the real case, the xml never goes through a file but is handed
directly to the parser. The api return a Python Unicode string
(utf-16).

A Python unicode object is *NOT* the UTF-16 that the SAX parser is
expecting. It is expecting a str object which is Unicode text encoded
as UTF-16.

At the end of this post is code using a str object (works) then
attempting to use a unicode object (reproduces your error message).

For the file the user sent, if I open it in binary mode, it
still has a BOM; otherwise the BOM is removed. But either version
works on my system.

The basic fact, though, remains, the same code works for me with the
same input but not for two particular users (out of hundreds).

If the real case doesn't involve a file, I can't imagine what you can
infer from a file that isn't used [strike 1] sent to you by a user
[strike 2].

Consider trapping the exception, write repr(the_xml_document_string[:
80]) to the log file and re-raise the exception. Get the user to run
the app. You inspect the log file.

Here's the promised code and results.

C:\junk>type utf16sax.py
import xml.sax, xml.sax.saxutils
import cStringIO
asciistr = 'qwertyuiop'
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
unicode_doc = (xml_template % ('UTF-16', asciistr)).decode('ascii')
utf16_doc = unicode_doc.encode('UTF-16')
for doc in (utf16_doc, unicode_doc):
print
print 'doc = ', repr(doc)
print
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata

C:\junk>utf16sax.py

doc = '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i
\x00o\x00n\x0
0=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n
\x00g\x00=\x0
0"\x00U\x00T\x00F\x00-\x001\x006\x00"\x00?\x00>\x00<\x00d\x00a\x00t
\x00a\x00>\x0
0q\x00w\x00e\x00r\x00t\x00y\x00u\x00i\x00o\x00p\x00<\x00/\x00d\x00a
\x00t\x00a\x0
0>\x00'

SAX output (UTF-8): 'qwertyuiop'

doc = u'<?xml version="1.0" encoding="UTF-16"?><data>qwertyuiop</
data>'

Traceback (most recent call last):
File "C:\junk\utf16sax.py", line 13, in <module>
xml.sax.parseString(doc, handler)
File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: encoding
specified in XML
declaration is incorrect

I guess what is happening is that the unicode is coerced to str using
the default encoding (ascii) then it looks at the result, parses out
the "UTF-16", attempts to decode it using utf-16, fails, complains.

HTH,
John

JKPeck · Feb 4, 2008

-On [20080201 19:06] said:
-On [20080201 19:06] said:

In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

Click to expand...

Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?

It appears that the root cause of this problem is indeed passing a
Unicode XML string to xml.sax.parseString with an encoding declaration
in the XML of utf-16. This works with the standard distribution on
Windows. It does not work with ActiveState on Windows even though
both distributions report
64K for sys.maxunicode.

So I don't know why the results are different, but the problem is
solved by encoding the Unicode string into utf-16 before passing it to
the parser.

Thanks to all for helping to track this down.

Regards,
Jon Peck

John Machin · Feb 4, 2008

-On [20080201 19:06], JKPeck ([email protected]) wrote:

In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

Click to expand...

Click to expand...

Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?

Click to expand...

It appears that the root cause of this problem is indeed passing a
Unicode XML string to xml.sax.parseString with an encoding declaration
in the XML of utf-16. This works with the standard distribution on
Windows.

It did NOT work for me with the standard 2.5.1 Windows distribution --
see the code + output that I posted.

JKPeck · Feb 5, 2008

-On [20080201 19:06], JKPeck ([email protected]) wrote:
In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.
Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?

Click to expand...

Click to expand...

It appears that the root cause of this problem is indeed passing a
Unicode XML string to xml.sax.parseString with an encoding declaration
in the XML of utf-16. This works with the standard distribution on
Windows.

Click to expand...

It did NOT work for me with the standard 2.5.1 Windows distribution --
see the code + output that I posted.

It does not work with ActiveState on Windows even though
both distributions report
64K for sys.maxunicode.

Click to expand...

So I don't know why the results are different, but the problem is
solved by encoding the Unicode string into utf-16 before passing it to
the parser.

Click to expand...

Interesting. In the course of installing and testing with
ActiveState, I upgraded from the standard distribution 2.5.0 to
2.5.1. The former worked; the latter does not (with the original
code). So that ..1 seems to matter here, and that probably accounts
for why ActiveState raised the exception and the standard 2.5.0 did
not.

-Jon

Sequential XML parsing with xml.sax	2	Aug 23, 2005
How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
Accessing "sub elements" with xml.sax ?	1	Feb 25, 2008
xml.sax problem, help needed.	0	Aug 1, 2006
"encoding specified in XML declaration is incorrect"	1	Dec 2, 2004
EJB Bindings - Class Cast Exception	0	Sep 21, 2017
xml.sax._exceptions.SAXReaderNotAvailable	0	Jan 10, 2005
files.py (encoding error)	0	Jun 10, 2013

Mysterious xml.sax Encoding Exception

JKPeck

Martin v. Löwis

JKPeck

Martin v. Löwis

JKPeck

Martin v. Löwis

Jeroen Ruigrok van der Werven

John Machin

JKPeck

John Machin

JKPeck

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads