cElementTree encoding woes

Diez B. Roggisch · Feb 20, 2006

Hi,

I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say they
are "encodingly challanged", so they don't produce utf-8, but only cp1252
instead, together with some weird name (Windows-1252) for that. That is not
part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder.

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

Any suggestions?

Diez

Peter Otten · Feb 20, 2006

Diez said:
I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say
they are "encodingly challanged", so they don't produce utf-8, but only
cp1252 instead, together with some weird name (Windows-1252) for that.
That is not part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder.

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

Any suggestions?

Both my python2.3 and python2.4 interpreters seem to know "Windows-1252":
<open file 'windows.xml', mode 'rb' at 0x403737e0>

Maybe the problem lies in the python installation rather than cElementTree?
Just guessing, though.

Peter

Diez B. Roggisch · Feb 20, 2006

Both my python2.3 and python2.4 interpreters seem to know "Windows-1252":

<open file 'windows.xml', mode 'rb' at 0x403737e0>

Maybe the problem lies in the python installation rather than
cElementTree? Just guessing, though.

Hm. No idea why I was under the impression they weren't there - but still,
it doesn't work: I get

inf = file(sys.argv[1])
#inf = codecs.StreamRecoder(inf,encoder, decoder, reader, writer)

for event, elem in cElementTree.iterparse(inf):
pass

pukes on me with

Traceback (most recent call last):
File "./splitter.py", line 31, in ?
for event, elem in cElementTree.iterparse(inf):
File "<string>", line 61, in __iter__
SyntaxError: not well-formed (invalid token): line 35, column 34

That is the first french character encountered.

"""<title>Introduction aux ProbabilitÃ©s</title>"""

So - then the problem is not the codec being ignored, but it simply is not
working.

Regards,

Diez

Fredrik Lundh · Feb 20, 2006

Diez said:
I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say they
are "encodingly challanged", so they don't produce utf-8, but only cp1252
instead, together with some weird name (Windows-1252) for that. That is not
part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder.

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

you need cET 1.0.5 or later for this to work. for earlier versions, you have to use
stream recoding:

http://effbot.org/zone/celementtree-encoding.htm

</F>

xhtml encoding question	8	Jan 31, 2012
xml.dom.minidom character encoding	6	Apr 21, 2010
preferred way to set encoding for print	5	Sep 15, 2009
Response Encoding related	0	Dec 3, 2009
Python3 - encoding issues	4	Nov 29, 2009
Issue with xml iterparse	4	Jun 3, 2010
Guessing _charset for MIMEText ?	0	Jan 18, 2006
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023

cElementTree encoding woes

Diez B. Roggisch

Peter Otten

Diez B. Roggisch

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads