SAX unicode and ascii parsing problem

goldtech · Nov 30, 2010

Hi,

I'm trying to parse an xml file using SAX. About half-way through a
file I get this error:

Traceback (most recent call last):
File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 325, in RunScript
exec codeObject in __main__.__dict__
File "E:\sc\b2.py", line 58, in <module>
parser.parse(open(r'ppb5.xml'))
File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
end_element
self._cont_handler.endElement(name)
File "E:\sc\b2.py", line 51, in endElement
d.write(csv+"\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position
146-147: ordinal not in range(128)

I'm using ActivePython 2.6. I trying to figure out the simplest fix.
If there's a Python way to just take the source XML file and covert/
process it so this will not happen - that would be best. Or should I
just update to Python 3 ?

I tried this but nothing changed, I thought this might convert it and
then I'd paerse the new file - didn't work:

uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
ascii = uc.decode('ascii')
mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
mex9.write(ascii)

Again I'm looking for something simple even it's a few more lines of
codes...or upgrade(?)

Thanks, appreciate any help.
mex9.close()

Steve Holden · Nov 30, 2010

Hi,

I'm trying to parse an xml file using SAX. About half-way through a
file I get this error:

Traceback (most recent call last):
File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 325, in RunScript
exec codeObject in __main__.__dict__
File "E:\sc\b2.py", line 58, in <module>
parser.parse(open(r'ppb5.xml'))
File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
end_element
self._cont_handler.endElement(name)
File "E:\sc\b2.py", line 51, in endElement
d.write(csv+"\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position
146-147: ordinal not in range(128)

I'm using ActivePython 2.6. I trying to figure out the simplest fix.
If there's a Python way to just take the source XML file and covert/
process it so this will not happen - that would be best. Or should I
just update to Python 3 ?

I tried this but nothing changed, I thought this might convert it and
then I'd paerse the new file - didn't work:

uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
ascii = uc.decode('ascii')
mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
mex9.write(ascii)

Again I'm looking for something simple even it's a few more lines of
codes...or upgrade(?)

Thanks, appreciate any help.
mex9.close()

I'm just as stumped as I was when you first asked this question 13
minutes ago. ;-)

regards
Steve

goldtech · Nov 30, 2010

snip...

I'm just as stumped as I was when you first asked this question 13
minutes ago. ;-)

regards
Steve

snip...

Hi Steve,

Think I found it, for example:

line = 'my big string'
line.encode('ascii', 'ignore')

I processed the problem strings during parsing with this and it works
now. Got this from:

http://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors

Best, Lee

:^)

Stefan Behnel · Dec 1, 2010

goldtech, 30.11.2010 22:15:

Think I found it, for example:

line = 'my big string'
line.encode('ascii', 'ignore')

I processed the problem strings during parsing with this and it works
now.

That's not the right way of dealing with encodings, though. You should open
the file with a well defined encoding (using codecs.open() or io.open() in
Python >= 2.6), and then write the unicode strings into it just as you get
them.

Stefan

Ulrich Eckhardt · Dec 1, 2010

goldtech said:
I tried this but nothing changed, I thought this might convert it and
then I'd paerse the new file - didn't work:

uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
ascii = uc.decode('ascii')
mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
mex9.write(ascii)

This doesn't make sense either. decode() will convert bytes into (Unicode)
characters. After the first decode('utf8'), you have those already. Calling
decode('ascii') on that doesn't make sense. If you want ASCII, as the
assignee suggests, you need to _encode_ the string. Be aware that not all
characters can be represented as ASCII though, and the presence of such a
character seems to have caused your initial problem.

BTW:
- XML is not necessarily UTF-8, but that's a different issue.
- I would suggest you open files with 'rb' or 'wb' in order to suppress any
conversions on line endings. Especially writing UTF-16 would fail if that
is active.

Good luck!

Uli

SAX XML Parse Python error message	5	Jul 13, 2008
Unicode characters, XML/RSS	1	Jul 31, 2008
sax barfs on unicode filenames	9	Oct 4, 2006
python SUDS library	1	Mar 4, 2010
Help! Identical code doesn't work in Wing IDE but does in Komodo.	4	Apr 20, 2006
I can't get multi-dimensional array to work...	3	Mar 30, 2007
Sequential XML parsing with xml.sax	2	Aug 23, 2005
Code not work - DESPERATE HELP :(	18	Oct 30, 2008

SAX unicode and ascii parsing problem

goldtech

Steve Holden

goldtech

Stefan Behnel

Ulrich Eckhardt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads