cElementTree encoding woes

Discussion in 'Python' started by Diez B. Roggisch, Feb 20, 2006.

  1. Hi,

    I've got to deal with a pretty huge XML-document, and to do so I use the
    cElementTree.iterparse functionality. Working great.

    Only trouble: The guys creating that chunk of XML - well, lets just say they
    are "encodingly challanged", so they don't produce utf-8, but only cp1252
    instead, together with some weird name (Windows-1252) for that. That is not
    part of the standard codecs module. cp1252 is, of course.

    But that won't work for iterparse. So currently, I manually change the
    encoding given to utf-8, and use a stream-recoder.

    However, I was wondering if I could teach cElementTree about that encoding
    name. I tried to register cp1252 under the name Windows-1252, but had no
    luck - cET won't buy it.

    Any suggestions?

    Diez
    Diez B. Roggisch, Feb 20, 2006
    #1
    1. Advertising

  2. Diez B. Roggisch

    Peter Otten Guest

    Diez B. Roggisch wrote:

    > I've got to deal with a pretty huge XML-document, and to do so I use the
    > cElementTree.iterparse functionality. Working great.
    >
    > Only trouble: The guys creating that chunk of XML - well, lets just say
    > they are "encodingly challanged", so they don't produce utf-8, but only
    > cp1252 instead, together with some weird name (Windows-1252) for that.
    > That is not part of the standard codecs module. cp1252 is, of course.
    >
    > But that won't work for iterparse. So currently, I manually change the
    > encoding given to utf-8, and use a stream-recoder.
    >
    > However, I was wondering if I could teach cElementTree about that encoding
    > name. I tried to register cp1252 under the name Windows-1252, but had no
    > luck - cET won't buy it.
    >
    > Any suggestions?


    Both my python2.3 and python2.4 interpreters seem to know "Windows-1252":

    >>> import codecs
    >>> codecs.open("windows.xml", encoding="windows-1252")

    <open file 'windows.xml', mode 'rb' at 0x403737e0>

    Maybe the problem lies in the python installation rather than cElementTree?
    Just guessing, though.

    Peter
    Peter Otten, Feb 20, 2006
    #2
    1. Advertising

  3. > Both my python2.3 and python2.4 interpreters seem to know "Windows-1252":
    >
    >>>> import codecs
    >>>> codecs.open("windows.xml", encoding="windows-1252")

    > <open file 'windows.xml', mode 'rb' at 0x403737e0>
    >
    > Maybe the problem lies in the python installation rather than
    > cElementTree? Just guessing, though.


    Hm. No idea why I was under the impression they weren't there - but still,
    it doesn't work: I get

    inf = file(sys.argv[1])
    #inf = codecs.StreamRecoder(inf,encoder, decoder, reader, writer)

    for event, elem in cElementTree.iterparse(inf):
    pass

    pukes on me with

    Traceback (most recent call last):
    File "./splitter.py", line 31, in ?
    for event, elem in cElementTree.iterparse(inf):
    File "<string>", line 61, in __iter__
    SyntaxError: not well-formed (invalid token): line 35, column 34

    That is the first french character encountered.

    """<title>Introduction aux Probabilités</title>"""


    So - then the problem is not the codec being ignored, but it simply is not
    working.

    Regards,

    Diez
    Diez B. Roggisch, Feb 20, 2006
    #3
  4. Diez B. Roggisch wrote:

    > I've got to deal with a pretty huge XML-document, and to do so I use the
    > cElementTree.iterparse functionality. Working great.
    >
    > Only trouble: The guys creating that chunk of XML - well, lets just say they
    > are "encodingly challanged", so they don't produce utf-8, but only cp1252
    > instead, together with some weird name (Windows-1252) for that. That is not
    > part of the standard codecs module. cp1252 is, of course.
    >
    > But that won't work for iterparse. So currently, I manually change the
    > encoding given to utf-8, and use a stream-recoder.
    >
    > However, I was wondering if I could teach cElementTree about that encoding
    > name. I tried to register cp1252 under the name Windows-1252, but had no
    > luck - cET won't buy it.


    you need cET 1.0.5 or later for this to work. for earlier versions, you have to use
    stream recoding:

    http://effbot.org/zone/celementtree-encoding.htm

    </F>
    Fredrik Lundh, Feb 20, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fredrik Lundh

    ANN: cElementTree 0.9.8 (january 23, 2005)

    Fredrik Lundh, Jan 23, 2005, in forum: Python
    Replies:
    0
    Views:
    278
    Fredrik Lundh
    Jan 23, 2005
  2. Kent Johnson

    Subclassing cElementTree.Element

    Kent Johnson, Feb 7, 2005, in forum: Python
    Replies:
    1
    Views:
    760
    Fredrik Lundh
    Feb 8, 2005
  3. Igor V. Rafienko

    cElementTree clear semantics

    Igor V. Rafienko, Sep 25, 2005, in forum: Python
    Replies:
    27
    Views:
    655
    Paul Boddie
    Sep 26, 2005
  4. Mark
    Replies:
    0
    Views:
    310
  5. Mark E. Smith
    Replies:
    0
    Views:
    252
    Mark E. Smith
    Oct 23, 2006
Loading...

Share This Page