cElementTree encoding woes

D

Diez B. Roggisch

Hi,

I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say they
are "encodingly challanged", so they don't produce utf-8, but only cp1252
instead, together with some weird name (Windows-1252) for that. That is not
part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder.

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

Any suggestions?

Diez
 
P

Peter Otten

Diez said:
I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say
they are "encodingly challanged", so they don't produce utf-8, but only
cp1252 instead, together with some weird name (Windows-1252) for that.
That is not part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder.

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

Any suggestions?

Both my python2.3 and python2.4 interpreters seem to know "Windows-1252":
<open file 'windows.xml', mode 'rb' at 0x403737e0>

Maybe the problem lies in the python installation rather than cElementTree?
Just guessing, though.

Peter
 
D

Diez B. Roggisch

Both my python2.3 and python2.4 interpreters seem to know "Windows-1252":
<open file 'windows.xml', mode 'rb' at 0x403737e0>

Maybe the problem lies in the python installation rather than
cElementTree? Just guessing, though.

Hm. No idea why I was under the impression they weren't there - but still,
it doesn't work: I get

inf = file(sys.argv[1])
#inf = codecs.StreamRecoder(inf,encoder, decoder, reader, writer)

for event, elem in cElementTree.iterparse(inf):
pass

pukes on me with

Traceback (most recent call last):
File "./splitter.py", line 31, in ?
for event, elem in cElementTree.iterparse(inf):
File "<string>", line 61, in __iter__
SyntaxError: not well-formed (invalid token): line 35, column 34

That is the first french character encountered.

"""<title>Introduction aux Probabilités</title>"""


So - then the problem is not the codec being ignored, but it simply is not
working.

Regards,

Diez
 
F

Fredrik Lundh

Diez said:
I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say they
are "encodingly challanged", so they don't produce utf-8, but only cp1252
instead, together with some weird name (Windows-1252) for that. That is not
part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder.

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

you need cET 1.0.5 or later for this to work. for earlier versions, you have to use
stream recoding:

http://effbot.org/zone/celementtree-encoding.htm

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top