SAX unicode and ascii parsing problem

Discussion in 'Python' started by goldtech, Nov 30, 2010.

  1. goldtech

    goldtech Guest

    Hi,

    I'm trying to parse an xml file using SAX. About half-way through a
    file I get this error:

    Traceback (most recent call last):
    File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
    \scriptutils.py", line 325, in RunScript
    exec codeObject in __main__.__dict__
    File "E:\sc\b2.py", line 58, in <module>
    parser.parse(open(r'ppb5.xml'))
    File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
    File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
    self.feed(buffer)
    File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
    File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
    end_element
    self._cont_handler.endElement(name)
    File "E:\sc\b2.py", line 51, in endElement
    d.write(csv+"\n")
    UnicodeEncodeError: 'ascii' codec can't encode characters in position
    146-147: ordinal not in range(128)

    I'm using ActivePython 2.6. I trying to figure out the simplest fix.
    If there's a Python way to just take the source XML file and covert/
    process it so this will not happen - that would be best. Or should I
    just update to Python 3 ?

    I tried this but nothing changed, I thought this might convert it and
    then I'd paerse the new file - didn't work:

    uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
    ascii = uc.decode('ascii')
    mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
    mex9.write(ascii)

    Again I'm looking for something simple even it's a few more lines of
    codes...or upgrade(?)

    Thanks, appreciate any help.
    mex9.close()
    goldtech, Nov 30, 2010
    #1
    1. Advertising

  2. goldtech

    Steve Holden Guest

    On 11/30/2010 3:43 PM, goldtech wrote:
    > Hi,
    >
    > I'm trying to parse an xml file using SAX. About half-way through a
    > file I get this error:
    >
    > Traceback (most recent call last):
    > File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
    > \scriptutils.py", line 325, in RunScript
    > exec codeObject in __main__.__dict__
    > File "E:\sc\b2.py", line 58, in <module>
    > parser.parse(open(r'ppb5.xml'))
    > File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
    > xmlreader.IncrementalParser.parse(self, source)
    > File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
    > self.feed(buffer)
    > File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
    > self._parser.Parse(data, isFinal)
    > File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
    > end_element
    > self._cont_handler.endElement(name)
    > File "E:\sc\b2.py", line 51, in endElement
    > d.write(csv+"\n")
    > UnicodeEncodeError: 'ascii' codec can't encode characters in position
    > 146-147: ordinal not in range(128)
    >
    > I'm using ActivePython 2.6. I trying to figure out the simplest fix.
    > If there's a Python way to just take the source XML file and covert/
    > process it so this will not happen - that would be best. Or should I
    > just update to Python 3 ?
    >
    > I tried this but nothing changed, I thought this might convert it and
    > then I'd paerse the new file - didn't work:
    >
    > uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
    > ascii = uc.decode('ascii')
    > mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
    > mex9.write(ascii)
    >
    > Again I'm looking for something simple even it's a few more lines of
    > codes...or upgrade(?)
    >
    > Thanks, appreciate any help.
    > mex9.close()


    I'm just as stumped as I was when you first asked this question 13
    minutes ago. ;-)

    regards
    Steve

    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    PyCon 2011 Atlanta March 9-17 http://us.pycon.org/
    See Python Video! http://python.mirocommunity.org/
    Holden Web LLC http://www.holdenweb.com/
    Steve Holden, Nov 30, 2010
    #2
    1. Advertising

  3. goldtech

    goldtech Guest

    Re: SAX unicode and ascii parsing problem

    snip...
    >
    > I'm just as stumped as I was when you first asked this question 13
    > minutes ago. ;-)
    >
    > regards
    >  Steve
    >

    snip...

    Hi Steve,

    Think I found it, for example:

    line = 'my big string'
    line.encode('ascii', 'ignore')

    I processed the problem strings during parsing with this and it works
    now. Got this from:

    http://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors


    Best, Lee

    :^)
    goldtech, Nov 30, 2010
    #3
  4. Re: SAX unicode and ascii parsing problem

    goldtech, 30.11.2010 22:15:
    > Think I found it, for example:
    >
    > line = 'my big string'
    > line.encode('ascii', 'ignore')
    >
    > I processed the problem strings during parsing with this and it works
    > now.


    That's not the right way of dealing with encodings, though. You should open
    the file with a well defined encoding (using codecs.open() or io.open() in
    Python >= 2.6), and then write the unicode strings into it just as you get
    them.

    Stefan
    Stefan Behnel, Dec 1, 2010
    #4
  5. goldtech wrote:
    > I tried this but nothing changed, I thought this might convert it and
    > then I'd paerse the new file - didn't work:
    >
    > uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
    > ascii = uc.decode('ascii')
    > mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
    > mex9.write(ascii)


    This doesn't make sense either. decode() will convert bytes into (Unicode)
    characters. After the first decode('utf8'), you have those already. Calling
    decode('ascii') on that doesn't make sense. If you want ASCII, as the
    assignee suggests, you need to _encode_ the string. Be aware that not all
    characters can be represented as ASCII though, and the presence of such a
    character seems to have caused your initial problem.

    BTW:
    - XML is not necessarily UTF-8, but that's a different issue.
    - I would suggest you open files with 'rb' or 'wb' in order to suppress any
    conversions on line endings. Especially writing UTF-16 would fail if that
    is active.

    Good luck!

    Uli

    --
    Domino Laser GmbH
    Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
    Ulrich Eckhardt, Dec 1, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. silviu

    SAX parsing problem

    silviu, Sep 19, 2003, in forum: XML
    Replies:
    4
    Views:
    557
    Bob Foster
    Sep 20, 2003
  2. Naren
    Replies:
    0
    Views:
    579
    Naren
    May 11, 2004
  3. TOXiC
    Replies:
    5
    Views:
    1,249
    TOXiC
    Jan 31, 2007
  4. Brian Smith
    Replies:
    0
    Views:
    362
    Brian Smith
    Feb 2, 2008
  5. goldtech
    Replies:
    2
    Views:
    498
    Adam Tauno Williams
    Dec 1, 2010
Loading...

Share This Page