elementtree XML() unicode

Discussion in 'Python' started by Kee Nethery, Nov 4, 2009.

  1. Kee Nethery

    Kee Nethery Guest

    Having an issue with elementtree XML() in python 2.6.4.

    This code works fine:

    from xml.etree import ElementTree as et
    getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
    <customer><shipping><state>bobble</state><city>head</
    city><street>city</street></shipping></customer>'''
    theResponseXml = et.XML(getResponse)

    This code errors out when it tries to do the et.XML()

    from xml.etree import ElementTree as et
    getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
    <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
    \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
    shipping></customer>'''
    theResponseXml = et.XML(getResponse)

    In my real code, I'm pulling the getResponse data from a web page that
    returns as XML and when I display it in the browser you can see the
    Japanese characters in the data. I've removed all the stuff in my code
    and tried to distill it down to just what is failing. Hopefully I have
    not removed something essential.

    Why is this not working and what do I need to do to use Elementtree
    with unicode?

    Thanks, Kee Nethery
    Kee Nethery, Nov 4, 2009
    #1
    1. Advertising

  2. Kee Nethery

    John Machin Guest

    On Nov 4, 11:01 am, Kee Nethery <> wrote:
    > Having an issue with elementtree XML() in python 2.6.4.
    >
    > This code works fine:
    >
    >       from xml.etree import ElementTree as et
    >       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
    > <customer><shipping><state>bobble</state><city>head</
    > city><street>city</street></shipping></customer>'''
    >       theResponseXml = et.XML(getResponse)
    >
    > This code errors out when it tries to do the et.XML()
    >
    >       from xml.etree import ElementTree as et
    >       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
    > <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
    > \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
    > shipping></customer>'''
    >       theResponseXml = et.XML(getResponse)
    >
    > In my real code, I'm pulling the getResponse data from a web page that  
    > returns as XML and when I display it in the browser you can see the  
    > Japanese characters in the data. I've removed all the stuff in my code  
    > and tried to distill it down to just what is failing. Hopefully I have  
    > not removed something essential.
    >
    > Why is this not working and what do I need to do to use Elementtree  
    > with unicode?


    On Nov 4, 11:01 am, Kee Nethery <> wrote:
    > Having an issue with elementtree XML() in python 2.6.4.
    >
    > This code works fine:
    >
    > from xml.etree import ElementTree as et
    > getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
    > <customer><shipping><state>bobble</state><city>head</
    > city><street>city</street></shipping></customer>'''
    > theResponseXml = et.XML(getResponse)
    >
    > This code errors out when it tries to do the et.XML()
    >
    > from xml.etree import ElementTree as et
    > getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
    > <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
    > \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
    > shipping></customer>'''
    > theResponseXml = et.XML(getResponse)
    >
    > In my real code, I'm pulling the getResponse data from a web page that
    > returns as XML and when I display it in the browser you can see the
    > Japanese characters in the data. I've removed all the stuff in my code
    > and tried to distill it down to just what is failing. Hopefully I have
    > not removed something essential.
    >
    > Why is this not working and what do I need to do to use Elementtree
    > with unicode?


    What you need to do is NOT feed it unicode. You feed it a str object
    and it gets decoded according to the encoding declaration found in the
    first line. So take the str object that you get from the web (should
    be UTF8-encoded already unless the header is lying), and throw that at
    ET ... like this:

    | Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
    (Intel)] on win32
    | Type "help", "copyright", "credits" or "license" for more
    information.
    | >>> from xml.etree import ElementTree as et
    | >>> ucode = u'''<?xml version="1.0" encoding="UTF-8"?>
    | ... <customer><shipping>
    | ... <state>\ue58d83\ue89189\ue79c8C</state>
    | ... <city>\ue69f8f\ue5b882</city>
    | ... <street>\ue9ab98\ue58d97\ue58fb03</street>
    | ... </shipping></customer>'''
    | >>> xml= et.XML(ucode)
    | Traceback (most recent call last):
    | File "<stdin>", line 1, in <module>
    | File "C:\python26\lib\xml\etree\ElementTree.py", line 963, in XML
    | parser.feed(text)
    | File "C:\python26\lib\xml\etree\ElementTree.py", line 1245, in
    feed
    | self._parser.Parse(data, 0)
    | UnicodeEncodeError: 'ascii' codec can't encode character u'\ue58d'
    in position 69: ordinal not in range(128)
    | # as expected
    | >>> strg = ucode.encode('utf8')
    | # encoding as utf8 is for DEMO purposes.
    | # i.e. use the original web str object, don't convert it to unicode
    | # and back to utf8.
    | >>> xml2 = et.XML(strg)
    | >>> xml2.tag
    | 'customer'
    | >>> for c in xml2.getchildren():
    | ... print c.tag, repr(c.text)
    | ...
    | shipping '\n'
    | >>> for c in xml2[0].getchildren():
    | ... print c.tag, repr(c.text)
    | ...
    | state u'\ue58d83\ue89189\ue79c8C'
    | city u'\ue69f8f\ue5b882'
    | street u'\ue9ab98\ue58d97\ue58fb03'
    | >>>

    By the way: (1) it usually helps to be more explicit than "errors
    out", preferably the exact copied/pasted output as shown above; this
    is one of the rare cases where the error message is predictable (2)
    PLEASE don't start a new topic in a reply in somebody else's thread.
    John Machin, Nov 4, 2009
    #2
    1. Advertising

  3. Kee Nethery

    Kee Nethery Guest

    On Nov 3, 2009, at 5:27 PM, John Machin wrote:

    > On Nov 4, 11:01 am, Kee Nethery <> wrote:
    >> Having an issue with elementtree XML() in python 2.6.4.
    >>
    >> This code works fine:
    >>
    >> from xml.etree import ElementTree as et
    >> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
    >> <customer><shipping><state>bobble</state><city>head</
    >> city><street>city</street></shipping></customer>'''
    >> theResponseXml = et.XML(getResponse)
    >>
    >> This code errors out when it tries to do the et.XML()
    >>
    >> from xml.etree import ElementTree as et
    >> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
    >> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
    >> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
    >> shipping></customer>'''
    >> theResponseXml = et.XML(getResponse)
    >>
    >> In my real code, I'm pulling the getResponse data from a web page
    >> that
    >> returns as XML and when I display it in the browser you can see the
    >> Japanese characters in the data. I've removed all the stuff in my
    >> code
    >> and tried to distill it down to just what is failing. Hopefully I
    >> have
    >> not removed something essential.
    >>
    >> Why is this not working and what do I need to do to use Elementtree
    >> with unicode?

    >
    > On Nov 4, 11:01 am, Kee Nethery <> wrote:
    >> Having an issue with elementtree XML() in python 2.6.4.
    >>
    >> This code works fine:
    >>
    >> from xml.etree import ElementTree as et
    >> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
    >> <customer><shipping><state>bobble</state><city>head</
    >> city><street>city</street></shipping></customer>'''
    >> theResponseXml = et.XML(getResponse)
    >>
    >> This code errors out when it tries to do the et.XML()
    >>
    >> from xml.etree import ElementTree as et
    >> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
    >> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
    >> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
    >> shipping></customer>'''
    >> theResponseXml = et.XML(getResponse)
    >>
    >> In my real code, I'm pulling the getResponse data from a web page
    >> that
    >> returns as XML and when I display it in the browser you can see the
    >> Japanese characters in the data. I've removed all the stuff in my
    >> code
    >> and tried to distill it down to just what is failing. Hopefully I
    >> have
    >> not removed something essential.
    >>
    >> Why is this not working and what do I need to do to use Elementtree
    >> with unicode?

    >
    > What you need to do is NOT feed it unicode. You feed it a str object
    > and it gets decoded according to the encoding declaration found in the
    > first line.


    That it uses "the encoding declaration found in the first line" is the
    nugget of data that is not in the documentation that has stymied me
    for days. Thank you!

    The other thing that has been confusing is that I've been using "dump"
    to view what is in the elementtree instance and the non-ASCII
    characters have been displayed as "numbered
    entities" (<city>柏市</city>) and I know that is not the
    representation I want the data to be in. A co-worker suggested that
    instead of "dump" that I use "et.tostring(theResponseXml,
    encoding='utf-8')" and then print that to see the characters. That
    process causes the non-ASCII characters to display as the glyphs I
    know them to be.

    If there was a place in the official docs for me to append these
    nuggets of information to the sections for
    "xml.etree.ElementTree.XML(text)" and
    "xml.etree.ElementTree.dump(elem)" I would absolutely do so.

    Thank you!
    Kee Nethery


    > So take the str object that you get from the web (should
    > be UTF8-encoded already unless the header is lying), and throw that at
    > ET ... like this:
    >
    > | Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
    > (Intel)] on win32
    > | Type "help", "copyright", "credits" or "license" for more
    > information.
    > | >>> from xml.etree import ElementTree as et
    > | >>> ucode = u'''<?xml version="1.0" encoding="UTF-8"?>
    > | ... <customer><shipping>
    > | ... <state>\ue58d83\ue89189\ue79c8C</state>
    > | ... <city>\ue69f8f\ue5b882</city>
    > | ... <street>\ue9ab98\ue58d97\ue58fb03</street>
    > | ... </shipping></customer>'''
    > | >>> xml= et.XML(ucode)
    > | Traceback (most recent call last):
    > | File "<stdin>", line 1, in <module>
    > | File "C:\python26\lib\xml\etree\ElementTree.py", line 963, in XML
    > | parser.feed(text)
    > | File "C:\python26\lib\xml\etree\ElementTree.py", line 1245, in
    > feed
    > | self._parser.Parse(data, 0)
    > | UnicodeEncodeError: 'ascii' codec can't encode character u'\ue58d'
    > in position 69: ordinal not in range(128)
    > | # as expected
    > | >>> strg = ucode.encode('utf8')
    > | # encoding as utf8 is for DEMO purposes.
    > | # i.e. use the original web str object, don't convert it to unicode
    > | # and back to utf8.
    > | >>> xml2 = et.XML(strg)
    > | >>> xml2.tag
    > | 'customer'
    > | >>> for c in xml2.getchildren():
    > | ... print c.tag, repr(c.text)
    > | ...
    > | shipping '\n'
    > | >>> for c in xml2[0].getchildren():
    > | ... print c.tag, repr(c.text)
    > | ...
    > | state u'\ue58d83\ue89189\ue79c8C'
    > | city u'\ue69f8f\ue5b882'
    > | street u'\ue9ab98\ue58d97\ue58fb03'
    > | >>>
    >
    > By the way: (1) it usually helps to be more explicit than "errors
    > out", preferably the exact copied/pasted output as shown above; this
    > is one of the rare cases where the error message is predictable (2)
    > PLEASE don't start a new topic in a reply in somebody else's thread.
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list





    -------------------------------------------------
    I check email roughly 2 to 3 times per business day.
    Kagi main office: +1 (510) 550-1336
    Kee Nethery, Nov 4, 2009
    #3
  4. Kee Nethery

    John Machin Guest

    On Nov 4, 1:06 pm, Kee Nethery <> wrote:
    > On Nov 3, 2009, at 5:27 PM, John Machin wrote:
    >
    >
    >
    > > On Nov 4, 11:01 am, Kee Nethery <> wrote:


    > >> Why is this not working and what do I need to do to use Elementtree
    > >> with unicode?

    >
    > > What you need to do is NOT feed it unicode. You feed it a str object
    > > and it gets decoded according to the encoding declaration found in the
    > > first line.

    >
    > That it uses "the encoding declaration found in the first line" is the  
    > nugget of data that is not in the documentation that has stymied me  
    > for days. Thank you!


    And under the "don't repeat" principle, it shouldn't be in the
    Elementtree docs; it's nothing special about ET -- it's part of the
    definition of an XML document (which for universal loss-free
    transportability naturally must be encoded somehow, and the document
    must state what its own encoding is (if it's not the default
    (UTF-8))).

    > The other thing that has been confusing is that I've been using "dump"  
    > to view what is in the elementtree instance and the non-ASCII  
    > characters have been displayed as "numbered  
    > entities" (<city>柏市</city>) and I know that is not the  
    > representation I want the data to be in. A co-worker suggested that  
    > instead of "dump" that I use "et.tostring(theResponseXml,  
    > encoding='utf-8')" and then print that to see the characters. That  
    > process causes the non-ASCII characters to display as the glyphs I  
    > know them to be.
    >
    > If there was a place in the official docs for me to append these  
    > nuggets of information to the sections for  
    > "xml.etree.ElementTree.XML(text)" and  
    > "xml.etree.ElementTree.dump(elem)" I would absolutely do so.


    I don't understand ... tostring() is in the same section as dump(),
    about two screen-heights away. You want to include the tostring() docs
    in the dump() docs? The usual idea is not to get bogged down in the
    first function that looks at first glance like it might do what you
    want ("look at the glyphs") but doesn't (it writes a (transportable)
    XML stream) but press on to the next plausible candidate.
    John Machin, Nov 4, 2009
    #4
  5. En Tue, 03 Nov 2009 23:06:58 -0300, Kee Nethery <> escribió:

    > If there was a place in the official docs for me to append these nuggets
    > of information to the sections for "xml.etree.ElementTree.XML(text)" and
    > "xml.etree.ElementTree.dump(elem)" I would absolutely do so.


    http://bugs.python.org/ applies to documentation too.

    --
    Gabriel Genellina
    Gabriel Genellina, Nov 4, 2009
    #5
  6. Kee Nethery

    Kee Nethery Guest

    On Nov 3, 2009, at 7:06 PM, Gabriel Genellina wrote:

    > En Tue, 03 Nov 2009 23:06:58 -0300, Kee Nethery <>
    > escribió:
    >
    >> If there was a place in the official docs for me to append these
    >> nuggets of information to the sections for
    >> "xml.etree.ElementTree.XML(text)" and
    >> "xml.etree.ElementTree.dump(elem)" I would absolutely do so.

    >
    > http://bugs.python.org/ applies to documentation too.


    I've submitted documentation bugs in the past and no action was taken
    on them, the bugs were closed. I'm guessing that information "that
    everyone knows" not being in the documentation is not a bug. It's my
    fault I'm a newbie and I accept that. Thanks to you two for helping me
    get past this block.

    Kee
    Kee Nethery, Nov 4, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Erik  Bethke

    ElementTree cannot parse UTF-8 Unicode?

    Erik Bethke, Jan 19, 2005, in forum: Python
    Replies:
    14
    Views:
    2,147
    Stephen Waterbury
    Jan 20, 2005
  2. =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

    ElementTree, XML and Unicode -- C0 Controls

    =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=, Dec 11, 2006, in forum: Python
    Replies:
    2
    Views:
    338
    =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=
    Dec 11, 2006
  3. Replies:
    13
    Views:
    922
  4. Kee Nethery
    Replies:
    12
    Views:
    2,076
    Stefan Behnel
    Jun 27, 2009
  5. Kee Nethery

    Re: elementtree XML() unicode

    Kee Nethery, Nov 4, 2009, in forum: Python
    Replies:
    3
    Views:
    471
    John Machin
    Nov 5, 2009
Loading...

Share This Page