sax barfs on unicode filenames

E

Edward K. Ream

Hi. Presumably this is a easy question, but anyone who understands the sax
docs thinks completely differently than I do :)



Following the usual cookbook examples, my app parses an open file as
follows::



parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external_ges,1)

# Hopefully the content handler can figure out the encoding from the <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)



Here 'theFile' is an open file. Usually this works just fine, but when the
filename contains u'\u8116' I get the following exception:



Traceback (most recent call last):



File "c:\prog\tigris-cvs\leo\src\leoFileCommands.py", line 2159, in
parse_leo_file

parser.parse(theFile)



File "c:\python25\lib\xml\sax\expatreader.py", line 107, in parse

xmlreader.IncrementalParser.parse(self, source)



File "c:\python25\lib\xml\sax\xmlreader.py", line 119, in parse

self.prepareParser(source)



File "c:\python25\lib\xml\sax\expatreader.py", line 111, in prepareParser

self._parser.SetBase(source.getSystemId())



UnicodeEncodeError: 'ascii' codec can't encode character u'\u8116' in
position 44: ordinal not in range(128)



Presumably the documentation at:



http://docs.python.org/lib/module-xml.sax.xmlreader.html



would be sufficient for a sax-head, but I have absolutely no idea of how to
create an InputSource that can handle non-ascii filenames.



Any help would be appreciated. Thanks!



Edward
 
D

Diez B. Roggisch

Edward said:
Hi. Presumably this is a easy question, but anyone who understands the
sax docs thinks completely differently than I do :)



Following the usual cookbook examples, my app parses an open file as
follows::



parser = xml.sax.make_parser()

parser.setFeature(xml.sax.handler.feature_external_ges,1)

# Hopefully the content handler can figure out the encoding from the
# <?xml>
element.

handler = saxContentHandler(c,inputFileName,silent)

parser.setContentHandler(handler)

parser.parse(theFile)



Here 'theFile' is an open file. Usually this works just fine, but when

Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

You have to encode the unicode string according to your filesystem
beforehand.

Diez
 
F

Fredrik Lundh

Diez said:
Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

it is ?
'hello'

</F>
 
E

Edward K. Ream

Filenames are expected to be bytestrings.

The exception happens in a method to which no fileName is passed as an
argument.

parse_leo_file:
'C:\\prog\\tigris-cvs\\leo\\test\\unittest\\chinese?folder\\chinese?test.leo'
(trace of converted fileName)

Unexpected exception parsing
C:\prog\tigris-cvs\leo\test\unittest\chinese?folder\chinese?test.leo
Traceback (most recent call last):

File "c:\prog\tigris-cvs\leo\src\leoFileCommands.py", line 2162, in
parse_leo_file
parser.parse(theFile)

File "c:\python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)

File "c:\python25\lib\xml\sax\xmlreader.py", line 119, in parse
self.prepareParser(source)

File "c:\python25\lib\xml\sax\expatreader.py", line 111, in prepareParser
self._parser.SetBase(source.getSystemId())

UnicodeEncodeError: 'ascii' codec can't encode character u'\u8116' in
position 44: ordinal not in range(128)

To repeat, theFile is an open file. I believe the actual filename is passed
nowhere as an argument to sax in my code. Just to make sure, I converted
the filename to ascii in my code, and got (no surprise) exactly the same
crash. I suppose a workaround would be to pass a 'file-like-object to sax
instead of an open file, so that theFile.getSystemId won't crash. But this
looks like a bug to me.

BTW:

Python 2.5.0, Tk 8.4.12, Pmw 1.2
Windows 5, 1, 2600, 2, Service Pack 2

Edward
 
J

John Machin

Diez said:
Filenames are expected to be bytestrings. So what happens is that the
unicode string you pass as filename gets implicitly converted using the
default encoding.

You have to encode the unicode string according to your filesystem
beforehand.

Not if your filesystem supports Unicode names, as Windows does.
Edward's point is that something is (whether by accident or "design")
trying to coerce it to str, and failing.
 
E

Edward K. Ream

Happily, the workaround is easy. Replace theFile with:

# Use cStringIo to avoid a crash in sax when inputFileName has unicode
characters.
s = theFile.read()
theFile = cStringIO.StringIO(s)

My first attempt at a workaround was to use:

s = theFile.read()
parser.parseString(s)

but the expat parser does not support parseString...

Edward
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Fredrik said:

Yes. While you can pass Unicode strings as file names to many Python
functions, you can't pass them to Expat, as Expat requires the file name
as a byte string. Hence the error.

Regards,
Martin

P.S. and just to anticipate nit-picking: yes, you can pass a Unicode
string to Expat, too, as long as the Unicode string only contains
ASCII characters. And yes, it doesn't have to be ASCII, if you change
the system default encoding.
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Edward said:
Happily, the workaround is easy. Replace theFile with:

# Use cStringIo to avoid a crash in sax when inputFileName has unicode
characters.
s = theFile.read()
theFile = cStringIO.StringIO(s)

My first attempt at a workaround was to use:

s = theFile.read()
parser.parseString(s)

but the expat parser does not support parseString...

Right - you would have to use xml.sax.parseString (which is a global
function, not a method).

Of course, parseString just does what you did: create a cStringIO
object and operate on that.

Regards,
Martin
 
F

Fredrik Lundh

Martin said:
Yes. While you can pass Unicode strings as file names to many Python
functions, you can't pass them to Expat, as Expat requires the file name
as a byte string. Hence the error.

sounds like a bug in the xml.sax layer, really (ET also uses Expat, and
doesn't seem to have any problems dealing with unicode filenames...)

</F>
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Fredrik said:
sounds like a bug in the xml.sax layer, really (ET also uses Expat, and
doesn't seem to have any problems dealing with unicode filenames...)

That's because ET never invokes XML_SetBase. Without testing, this
suggests that there might be problem in ET with relative URIs
in parsed external entities. XML_SetBase expects a char* for the
base URI.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top