ElementTree cannot parse UTF-8 Unicode?

Discussion in 'Python' started by Erik Bethke, Jan 19, 2005.

  1. Erik  Bethke

    Erik Bethke Guest

    Hello All,

    I am getting an error of not well-formed at the beginning of the Korean
    text in the second example. I am doing something wrong with how I am
    encoding my Korean? Do I need more of a wrapper about it than simple
    quotes? Is there some sort of XML syntax for indicating a Unicode
    string, or does the Elementree library just not support reading of
    Unicode?

    here is my test snippet:

    from elementtree import ElementTree
    vocabXML = ElementTree.parse('test2.xml').getroot()

    where I have two data files:

    this one works:
    <?xml version="1.0" encoding="UTF-8"?>
    <Vocab>
    <Word L1='Hahha'></Word>
    </Vocab>

    this one fails:
    <?xml version="1.0" encoding="UTF-8"?>
    <Vocab>
    <Word L1="어녕하세요!"></Word>
    </Vocab>
    Erik Bethke, Jan 19, 2005
    #1
    1. Advertising

  2. Erik Bethke wrote:

    > I am getting an error of not well-formed at the beginning of the Korean
    > text in the second example. I am doing something wrong with how I am
    > encoding my Korean? Do I need more of a wrapper about it than simple
    > quotes? Is there some sort of XML syntax for indicating a Unicode
    > string, or does the Elementree library just not support reading of
    > Unicode?


    XML is Unicode, and ElementTree supports all common encodings just
    fine (including UTF-8).

    > this one fails:
    > <?xml version="1.0" encoding="UTF-8"?>
    > <Vocab>
    > <Word L1="?????!"></Word>
    > </Vocab>


    this works just fine on my machine.

    what's the exact error message?

    what does

    print repr(open("test2.xml").read())

    print on your machine?

    what happens if you attempt to parse

    <Vocab>
    <Word L1="어녕하세요!" />
    </Vocab>

    ?

    </F>
    Fredrik Lundh, Jan 19, 2005
    #2
    1. Advertising

  3. Erik  Bethke

    Erik Bethke Guest

    Hello Fredrik,

    1) The exact error is in line 1160 of self._parser.Parse(data, 0 ):
    xml.parsers.expat.ExpatError: not well-formed (invalid token): line 3,
    column 16

    2) You are right in that the print of the file read works just fine.

    3) You are also right in that the digitally encoded unicode also works
    fine. However, this solution has two new problems:

    1) The xml file is now not human readable
    2) After ElementTree gets done parsing it, I am feeding the text to a
    wx.TextCtrl via .SetValue() but that is now giving me an error message
    of being unable to convert that style of string

    So it seems to me, that ElementTree is just not expecting to run into
    the Korean characters for it is at column 16 that these begin. Am I
    formatting the XML properly?

    Thank you,
    -Erik
    Erik Bethke, Jan 20, 2005
    #3
  4. On Wed, 19 Jan 2005 16:35:23 -0800, Erik Bethke wrote:
    > So it seems to me, that ElementTree is just not expecting to run into the
    > Korean characters for it is at column 16 that these begin. Am I
    > formatting the XML properly?


    You should post the file somewhere on the web. (I wouldn't expect Usenet
    to transmit it properly.)

    (Just jumping in to possibly save you a reply cycle.)
    Jeremy Bowers, Jan 20, 2005
    #4
  5. Erik Bethke wrote:

    > 2) You are right in that the print of the file read works just fine.


    but what does it look like? I saved a raw copy of your original mail,
    fixed the quoted-printable encoding, and got an UTF-8 encoded file
    that works just fine. the thing you've been parsing, and that you've
    cut and pasted into your mail, must be different, in some way.

    > 3) You are also right in that the digitally encoded unicode also works
    > fine. However, this solution has two new problems:


    that was just a test to make sure that your version of elementtree could
    handle Unicode characters on your platform.

    > 1) The xml file is now not human readable
    > 2) After ElementTree gets done parsing it, I am feeding the text to a
    > wx.TextCtrl via .SetValue() but that is now giving me an error message
    > of being unable to convert that style of string


    on my machine, the L1 attribute contains a Unicode string:

    >>> print repr(root.find("Word").get("L1"))

    u'\uc5b4\ub155\ud558\uc138\uc694!'

    what does it give you on your machine? (looks like wxPython cannot handle
    Unicode strings, but can that really be true?)

    > So it seems to me, that ElementTree is just not expecting to run into
    > the Korean characters for it is at column 16 that these begin. Am I
    > formatting the XML properly?


    nobody knows...

    </F>
    Fredrik Lundh, Jan 20, 2005
    #5
  6. Hi !

    >>> ...Usenet to transmit it properly


    newsgroups (NNTP) : yes, it does it
    usenet : perhaps (that depends on the newsgroups)
    clp : no





    Michel Claveau
    Do Re Mi chel La Si Do, Jan 20, 2005
    #6
  7. Fredrik Lundh, Quinta 20 Janeiro 2005 05:17, wrote:

    > what does it give you on your machine? (looks like wxPython cannot handle
    > Unicode strings, but can that really be true?)


    It does support Unicode if it was built to do so...

    --
    Godoy. <>
    Jorge Luiz Godoy Filho, Jan 20, 2005
    #7
  8. Jorge Luiz Godoy Filho wrote:

    >> what does it give you on your machine? (looks like wxPython cannot handle
    >> Unicode strings, but can that really be true?)

    >
    > It does support Unicode if it was built to do so...


    Python has supported Unicode in release 1.6, 2.0, 2.1, 2.2, 2.3 and 2.4, so
    you might think that Unicode should be enabled by default in a UI toolkit for
    Python...

    </F>
    Fredrik Lundh, Jan 20, 2005
    #8
  9. Erik  Bethke

    Erik Bethke Guest

    There is something wrong with the physical file... I d/l a trial
    version of XML Spy home edition and built an equivalent of the korean
    test file, and tried it and it got past the element tree error and now
    I am stuck with the wxEditCtrl error.

    To build the xml file in the first place I had code that looked like
    this:

    d=wxFileDialog( self, message="Choose a file",
    defaultDir=os.getcwd(), defaultFile="", wildcard="*.xml", style=wx.SAVE
    | wxOVERWRITE_PROMPT | wx.CHANGE_DIR)
    if d.ShowModal() == wx.ID_OK:
    # This returns a Python list of files that were selected.
    paths = d.GetPaths()
    layout = '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n'
    L1Word = self.t1.GetValue()
    L2Word = 'undefined'

    layout += '<Vocab>\n'
    layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'
    layout += '</Vocab>'
    open( paths[0], 'w' ).write(layout)
    d.Destroy()

    So apprantly there is something wrong with physically constructing the
    file in this manner?

    Thank you,
    -Erik
    Erik Bethke, Jan 20, 2005
    #9
  10. Erik Bethke wrote:

    > layout += '<Vocab>\n'
    > layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'


    what does "print repr(L1Word)" print (that is, what does wxPython return?).
    it should be a Unicode string, but that would give you an error when you write
    it out:

    >>> f = open("file.txt", "w")
    >>> f.write(u'\uc5b4\ub155\ud558\uc138\uc694!')

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode characters
    in position 0-4: ordinal not in range(128)

    have you hacked the default encoding in site/sitecustomize?

    what happens if you replace the L1Word term with L1Word.encode("utf-8")

    can you post the repr() (either of what's in your file or of the thing, whatever
    it is, that wxPython returns...)

    </F>
    Fredrik Lundh, Jan 20, 2005
    #10
  11. Erik  Bethke

    Erik Bethke Guest

    That was a great clue. I am an idiot and tapped on the wrong download
    link... now I can read and parse the xml file fine - as long as I
    create it in XML spy - if I create it by this method:

    d=wxFileDialog( self, message="Choose a file",
    defaultDir=os.getcwd(), defaultFile="", wildcard="*.xml", style=wx.SAVE
    | wxOVERWRITE_PROMPT | wx.CHANGE_DIR)
    if d.ShowModal() == wx.ID_OK:
    # This returns a Python list of files that were selected.
    paths = d.GetPaths()
    layout = '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n'
    L1Word = self.t1.GetValue()
    L2Word = 'undefined'

    layout += '<Vocab>\n'
    layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'
    layout += '</Vocab>'
    open( paths[0], 'w' ).write(layout)

    I get hung up on the write statement, I am off to look for a a Unicode
    capable file write I think...

    -Erik
    Erik Bethke, Jan 20, 2005
    #11
  12. Erik  Bethke

    Erik Bethke Guest

    Woo-hoo! Everything is working now!

    Thank you everyone!

    The TWO problems I had:

    1) I needed to save my XML file in the first place with this code:
    f = codecs.open(paths[0], 'w', 'utf8')
    2) I needed to download the UNICODE version of wxPython, duh.

    So why are there non-UNICODE versions of wxPython??? To save memory or
    something???

    Thank you all!

    Best!
    -Erik
    Erik Bethke, Jan 20, 2005
    #12
  13. Erik  Bethke

    Jarek Zgoda Guest

    Erik Bethke wrote:

    > So why are there non-UNICODE versions of wxPython??? To save memory or
    > something???


    Win95, Win98, WinME have problems with unicode. GTK1 does not support
    unicode at all.

    --
    Jarek Zgoda
    http://jpa.berlios.de/ | http://www.zgodowie.org/
    Jarek Zgoda, Jan 20, 2005
    #13
  14. Jarek Zgoda wrote:

    >> So why are there non-UNICODE versions of wxPython??? To save memory or
    >> something???

    >
    >
    > Win95, Win98, WinME have problems with unicode.


    This problem can be solved - on W9x, wxPython would have to
    pass all Unicode strings to WideCharToMultiByte, using
    CP_ACP, and then pass the result to the API function.

    Regards,
    Martin
    =?ISO-8859-2?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 20, 2005
    #14
  15. wxPython unicode/ansi builds [was Re: ElementTree cannot parse UTF-8Unicode?]

    Martin v. Löwis wrote:
    > Jarek Zgoda wrote:
    >>> So why are there non-UNICODE versions of wxPython??? To save memory or
    >>> something???


    Robin Dunn has an explanation here:

    http://wiki.wxpython.org/index.cgi/UnicodeBuild

    .... which is the first hit from a Google search on
    "wxpython unicode build".

    Also, from the wxPython downloads page:

    "There are two versions of wxPython for each of the supported
    Python versions on Win32. They are nearly identical, except one
    of them has been compiled with support for the Unicode version of
    the platform APIs. If you don't know what that means then you
    probably don't need the Unicode version, get the ANSI version
    instead. The Unicode verison works best on Windows NT/2000/XP. It
    will also mostly work on Windows 95/98/Me systems, but it is
    based on a Microsoft hack called MSLU (or unicows.dll) that
    translates unicode API calls to ansi API calls, but the coverage
    of the API is not complete so there are some difficult bugs
    lurking in there."

    Steve
    Stephen Waterbury, Jan 20, 2005
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    809
    Amit Khemka
    Jul 6, 2006
  2. XML ElementTree Parse.

    , Oct 12, 2006, in forum: Python
    Replies:
    2
    Views:
    404
  3. =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

    ElementTree and utf-16 encoding

    =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=, Dec 19, 2006, in forum: Python
    Replies:
    2
    Views:
    736
    =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=
    Dec 19, 2006
  4. Kee Nethery
    Replies:
    12
    Views:
    2,065
    Stefan Behnel
    Jun 27, 2009
  5. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    956
    Grzegorz ¦liwiñski
    Jan 19, 2011
Loading...

Share This Page