SAX unicode and ascii parsing problem


G

goldtech

Hi,

I'm trying to parse an xml file using SAX. About half-way through a
file I get this error:

Traceback (most recent call last):
File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 325, in RunScript
exec codeObject in __main__.__dict__
File "E:\sc\b2.py", line 58, in <module>
parser.parse(open(r'ppb5.xml'))
File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
end_element
self._cont_handler.endElement(name)
File "E:\sc\b2.py", line 51, in endElement
d.write(csv+"\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position
146-147: ordinal not in range(128)

I'm using ActivePython 2.6. I trying to figure out the simplest fix.
If there's a Python way to just take the source XML file and covert/
process it so this will not happen - that would be best. Or should I
just update to Python 3 ?

I tried this but nothing changed, I thought this might convert it and
then I'd paerse the new file - didn't work:

uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
ascii = uc.decode('ascii')
mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
mex9.write(ascii)

Again I'm looking for something simple even it's a few more lines of
codes...or upgrade(?)

Thanks, appreciate any help.
mex9.close()
 
Ad

Advertisements

S

Steve Holden

Hi,

I'm trying to parse an xml file using SAX. About half-way through a
file I get this error:

Traceback (most recent call last):
File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 325, in RunScript
exec codeObject in __main__.__dict__
File "E:\sc\b2.py", line 58, in <module>
parser.parse(open(r'ppb5.xml'))
File "C:\Python26\Lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python26\Lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "C:\Python26\Lib\xml\sax\expatreader.py", line 304, in
end_element
self._cont_handler.endElement(name)
File "E:\sc\b2.py", line 51, in endElement
d.write(csv+"\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position
146-147: ordinal not in range(128)

I'm using ActivePython 2.6. I trying to figure out the simplest fix.
If there's a Python way to just take the source XML file and covert/
process it so this will not happen - that would be best. Or should I
just update to Python 3 ?

I tried this but nothing changed, I thought this might convert it and
then I'd paerse the new file - didn't work:

uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
ascii = uc.decode('ascii')
mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
mex9.write(ascii)

Again I'm looking for something simple even it's a few more lines of
codes...or upgrade(?)

Thanks, appreciate any help.
mex9.close()

I'm just as stumped as I was when you first asked this question 13
minutes ago. ;-)

regards
Steve
 
S

Stefan Behnel

goldtech, 30.11.2010 22:15:
Think I found it, for example:

line = 'my big string'
line.encode('ascii', 'ignore')

I processed the problem strings during parsing with this and it works
now.

That's not the right way of dealing with encodings, though. You should open
the file with a well defined encoding (using codecs.open() or io.open() in
Python >= 2.6), and then write the unicode strings into it just as you get
them.

Stefan
 
Ad

Advertisements

U

Ulrich Eckhardt

goldtech said:
I tried this but nothing changed, I thought this might convert it and
then I'd paerse the new file - didn't work:

uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
ascii = uc.decode('ascii')
mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
mex9.write(ascii)

This doesn't make sense either. decode() will convert bytes into (Unicode)
characters. After the first decode('utf8'), you have those already. Calling
decode('ascii') on that doesn't make sense. If you want ASCII, as the
assignee suggests, you need to _encode_ the string. Be aware that not all
characters can be represented as ASCII though, and the presence of such a
character seems to have caused your initial problem.

BTW:
- XML is not necessarily UTF-8, but that's a different issue.
- I would suggest you open files with 'rb' or 'wb' in order to suppress any
conversions on line endings. Especially writing UTF-16 would fail if that
is active.

Good luck!

Uli
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top