sax barfs on unicode filenames

Discussion in 'Python' started by Edward K. Ream, Oct 4, 2006.

  1. Hi. Presumably this is a easy question, but anyone who understands the sax
    docs thinks completely differently than I do :)



    Following the usual cookbook examples, my app parses an open file as
    follows::



    parser = xml.sax.make_parser()

    parser.setFeature(xml.sax.handler.feature_external_ges,1)

    # Hopefully the content handler can figure out the encoding from the <?xml>
    element.

    handler = saxContentHandler(c,inputFileName,silent)

    parser.setContentHandler(handler)

    parser.parse(theFile)



    Here 'theFile' is an open file. Usually this works just fine, but when the
    filename contains u'\u8116' I get the following exception:



    Traceback (most recent call last):



    File "c:\prog\tigris-cvs\leo\src\leoFileCommands.py", line 2159, in
    parse_leo_file

    parser.parse(theFile)



    File "c:\python25\lib\xml\sax\expatreader.py", line 107, in parse

    xmlreader.IncrementalParser.parse(self, source)



    File "c:\python25\lib\xml\sax\xmlreader.py", line 119, in parse

    self.prepareParser(source)



    File "c:\python25\lib\xml\sax\expatreader.py", line 111, in prepareParser

    self._parser.SetBase(source.getSystemId())



    UnicodeEncodeError: 'ascii' codec can't encode character u'\u8116' in
    position 44: ordinal not in range(128)



    Presumably the documentation at:



    http://docs.python.org/lib/module-xml.sax.xmlreader.html



    would be sufficient for a sax-head, but I have absolutely no idea of how to
    create an InputSource that can handle non-ascii filenames.



    Any help would be appreciated. Thanks!



    Edward
    --------------------------------------------------------------------
    Edward K. Ream email:
    Leo: http://webpages.charter.net/edreamleo/front.html
    --------------------------------------------------------------------
     
    Edward K. Ream, Oct 4, 2006
    #1
    1. Advertising

  2. Edward K. Ream wrote:

    > Hi. Presumably this is a easy question, but anyone who understands the
    > sax docs thinks completely differently than I do :)
    >
    >
    >
    > Following the usual cookbook examples, my app parses an open file as
    > follows::
    >
    >
    >
    > parser = xml.sax.make_parser()
    >
    > parser.setFeature(xml.sax.handler.feature_external_ges,1)
    >
    > # Hopefully the content handler can figure out the encoding from the
    > # <?xml>
    > element.
    >
    > handler = saxContentHandler(c,inputFileName,silent)
    >
    > parser.setContentHandler(handler)
    >
    > parser.parse(theFile)
    >
    >
    >
    > Here 'theFile' is an open file. Usually this works just fine, but when


    Filenames are expected to be bytestrings. So what happens is that the
    unicode string you pass as filename gets implicitly converted using the
    default encoding.

    You have to encode the unicode string according to your filesystem
    beforehand.

    Diez
     
    Diez B. Roggisch, Oct 4, 2006
    #2
    1. Advertising

  3. Diez B. Roggisch wrote:

    > Filenames are expected to be bytestrings. So what happens is that the
    > unicode string you pass as filename gets implicitly converted using the
    > default encoding.


    it is ?

    >>> f = open(u"\u8116", "w")
    >>> f.write("hello")
    >>> f.close()


    >>> f = open(u"\u8116", "r")
    >>> f.read()

    'hello'

    </F>
     
    Fredrik Lundh, Oct 4, 2006
    #3
  4. > Filenames are expected to be bytestrings.

    The exception happens in a method to which no fileName is passed as an
    argument.

    parse_leo_file:
    'C:\\prog\\tigris-cvs\\leo\\test\\unittest\\chinese?folder\\chinese?test.leo'
    (trace of converted fileName)

    Unexpected exception parsing
    C:\prog\tigris-cvs\leo\test\unittest\chinese?folder\chinese?test.leo
    Traceback (most recent call last):

    File "c:\prog\tigris-cvs\leo\src\leoFileCommands.py", line 2162, in
    parse_leo_file
    parser.parse(theFile)

    File "c:\python25\lib\xml\sax\expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)

    File "c:\python25\lib\xml\sax\xmlreader.py", line 119, in parse
    self.prepareParser(source)

    File "c:\python25\lib\xml\sax\expatreader.py", line 111, in prepareParser
    self._parser.SetBase(source.getSystemId())

    UnicodeEncodeError: 'ascii' codec can't encode character u'\u8116' in
    position 44: ordinal not in range(128)

    To repeat, theFile is an open file. I believe the actual filename is passed
    nowhere as an argument to sax in my code. Just to make sure, I converted
    the filename to ascii in my code, and got (no surprise) exactly the same
    crash. I suppose a workaround would be to pass a 'file-like-object to sax
    instead of an open file, so that theFile.getSystemId won't crash. But this
    looks like a bug to me.

    BTW:

    Python 2.5.0, Tk 8.4.12, Pmw 1.2
    Windows 5, 1, 2600, 2, Service Pack 2

    Edward
    --------------------------------------------------------------------
    Edward K. Ream email:
    Leo: http://webpages.charter.net/edreamleo/front.html
    --------------------------------------------------------------------
     
    Edward K. Ream, Oct 4, 2006
    #4
  5. Edward K. Ream

    John Machin Guest

    Diez B. Roggisch wrote:
    > Edward K. Ream wrote:
    >
    > > Hi. Presumably this is a easy question, but anyone who understands the
    > > sax docs thinks completely differently than I do :)
    > >
    > >
    > >
    > > Following the usual cookbook examples, my app parses an open file as
    > > follows::
    > >
    > >
    > >
    > > parser = xml.sax.make_parser()
    > >
    > > parser.setFeature(xml.sax.handler.feature_external_ges,1)
    > >
    > > # Hopefully the content handler can figure out the encoding from the
    > > # <?xml>
    > > element.
    > >
    > > handler = saxContentHandler(c,inputFileName,silent)
    > >
    > > parser.setContentHandler(handler)
    > >
    > > parser.parse(theFile)
    > >
    > >
    > >
    > > Here 'theFile' is an open file. Usually this works just fine, but when

    >
    > Filenames are expected to be bytestrings. So what happens is that the
    > unicode string you pass as filename gets implicitly converted using the
    > default encoding.
    >
    > You have to encode the unicode string according to your filesystem
    > beforehand.


    Not if your filesystem supports Unicode names, as Windows does.
    Edward's point is that something is (whether by accident or "design")
    trying to coerce it to str, and failing.
     
    John Machin, Oct 4, 2006
    #5
  6. Re: sax barfs on unicode filenames: workaround

    Happily, the workaround is easy. Replace theFile with:

    # Use cStringIo to avoid a crash in sax when inputFileName has unicode
    characters.
    s = theFile.read()
    theFile = cStringIO.StringIO(s)

    My first attempt at a workaround was to use:

    s = theFile.read()
    parser.parseString(s)

    but the expat parser does not support parseString...

    Edward
    --------------------------------------------------------------------
    Edward K. Ream email:
    Leo: http://webpages.charter.net/edreamleo/front.html
    --------------------------------------------------------------------
     
    Edward K. Ream, Oct 4, 2006
    #6
  7. Fredrik Lundh schrieb:
    > Diez B. Roggisch wrote:
    >
    >> Filenames are expected to be bytestrings. So what happens is that the
    >> unicode string you pass as filename gets implicitly converted using the
    >> default encoding.

    >
    > it is ?


    Yes. While you can pass Unicode strings as file names to many Python
    functions, you can't pass them to Expat, as Expat requires the file name
    as a byte string. Hence the error.

    Regards,
    Martin

    P.S. and just to anticipate nit-picking: yes, you can pass a Unicode
    string to Expat, too, as long as the Unicode string only contains
    ASCII characters. And yes, it doesn't have to be ASCII, if you change
    the system default encoding.
     
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=, Oct 4, 2006
    #7
  8. Re: sax barfs on unicode filenames: workaround

    Edward K. Ream schrieb:
    > Happily, the workaround is easy. Replace theFile with:
    >
    > # Use cStringIo to avoid a crash in sax when inputFileName has unicode
    > characters.
    > s = theFile.read()
    > theFile = cStringIO.StringIO(s)
    >
    > My first attempt at a workaround was to use:
    >
    > s = theFile.read()
    > parser.parseString(s)
    >
    > but the expat parser does not support parseString...


    Right - you would have to use xml.sax.parseString (which is a global
    function, not a method).

    Of course, parseString just does what you did: create a cStringIO
    object and operate on that.

    Regards,
    Martin
     
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=, Oct 4, 2006
    #8
  9. Martin v. Löwis wrote:

    > Yes. While you can pass Unicode strings as file names to many Python
    > functions, you can't pass them to Expat, as Expat requires the file name
    > as a byte string. Hence the error.


    sounds like a bug in the xml.sax layer, really (ET also uses Expat, and
    doesn't seem to have any problems dealing with unicode filenames...)

    </F>
     
    Fredrik Lundh, Oct 4, 2006
    #9
  10. Fredrik Lundh schrieb:
    > Martin v. Löwis wrote:
    >
    >> Yes. While you can pass Unicode strings as file names to many Python
    >> functions, you can't pass them to Expat, as Expat requires the file name
    >> as a byte string. Hence the error.

    >
    > sounds like a bug in the xml.sax layer, really (ET also uses Expat, and
    > doesn't seem to have any problems dealing with unicode filenames...)


    That's because ET never invokes XML_SetBase. Without testing, this
    suggests that there might be problem in ET with relative URIs
    in parsed external entities. XML_SetBase expects a char* for the
    base URI.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Oct 4, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Timo

    Response.AppendHeader barfs

    Timo, Mar 23, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    605
  2. B.J.
    Replies:
    4
    Views:
    744
    Toby Inkster
    Apr 23, 2005
  3. Bill Davy
    Replies:
    12
    Views:
    893
    Bill Davy
    Apr 22, 2005
  4. Bill Davy
    Replies:
    0
    Views:
    386
    Bill Davy
    Apr 19, 2005
  5. Jaime Wyant
    Replies:
    0
    Views:
    388
    Jaime Wyant
    Apr 21, 2005
Loading...

Share This Page