Parsing unicode (devanagari) text with xml.dom.minidom

Discussion in 'Python' started by rparimi@gmail.com, Mar 8, 2009.

  1. Guest

    Hello,

    I am trying to process an xml file that contains unicode characters
    (see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
    entire content of the website into an xml file. Using
    xml.dom.minidom, I wrote a few lines of python code to parse out the
    xml file, but am stuck with the following error:

    >>> import xml.dom.minidom
    >>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
    >>> titles = dom.getElementsByTagName("title")
    >>> for title in titles:

    .... print "childNode = ", title.childNodes
    ....
    childNode = [<DOM Text node "Sanskrit N...">]
    childNode = [<DOM Text node "Sanskrit N...">]
    childNode = []
    childNode = []
    childNode = [<DOM Text node "1-1-1">]
    childNode = Traceback (most recent call last):
    File "<stdin>", line 2, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode characters in position
    16-18: ordinal not in range(128)
    >>>


    Python exited when it was trying to parse the following node:
    <title>अनॠ</title>

    The xml header tells me that the document is UTF-8:
    <?xml version="1.0" encoding="UTF-8"?>

    I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
    as below:
    $locale
    LANG="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_CTYPE="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_ALL=


    I googled around for similar errors, and tried using unicode but that
    didn't help either:
    >>> foo = unicode(titles[5].childNodes)

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode characters in position
    16-18: ordinal not in range(128)

    I'm a novice with unicode, and am not not sure about how best to
    handle the unicode text I'm dealing with (devanagari). Any
    suggestions will be helpful.

    Thanks
     
    , Mar 8, 2009
    #1
    1. Advertising

  2. wrote:
    > I am trying to process an xml file that contains unicode characters
    > (see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
    > entire content of the website into an xml file. Using
    > xml.dom.minidom, I wrote a few lines of python code to parse out the
    > xml file, but am stuck with the following error:
    >
    >>>> import xml.dom.minidom
    >>>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
    >>>> titles = dom.getElementsByTagName("title")
    >>>> for title in titles:

    > ... print "childNode = ", title.childNodes
    > ...
    > childNode = [<DOM Text node "Sanskrit N...">]
    > childNode = [<DOM Text node "Sanskrit N...">]
    > childNode = []
    > childNode = []
    > childNode = [<DOM Text node "1-1-1">]
    > childNode = Traceback (most recent call last):
    > File "<stdin>", line 2, in <module>
    > UnicodeEncodeError: 'ascii' codec can't encode characters in position
    > 16-18: ordinal not in range(128)


    That's because you are printing it out to your console, in which case you
    need to make sure it's encoded properly for printing. repr() might also help.

    Regarding minidom, you might be happier with the xml.etree package that
    comes with Python2.5 and later (it's also avalable for older versions).
    It's a lot easier to use, more memory friendly and also much faster.

    Stefan
     
    Stefan Behnel, Mar 8, 2009
    #2
    1. Advertising

  3. > Regarding minidom, you might be happier with the xml.etree package that
    > comes with Python2.5 and later (it's also avalable for older versions).
    > It's a lot easier to use, more memory friendly and also much faster.


    OTOH, choice of XML library is completely irrelevant for the issue at
    hand. If the OP is happy with minidom, we shouldn't talk him into using
    something else.

    Regards,
    Martin
     
    Martin v. Löwis, Mar 8, 2009
    #3
  4. Martin v. Löwis wrote:
    >> Regarding minidom, you might be happier with the xml.etree package that
    >> comes with Python2.5 and later (it's also avalable for older versions).
    >> It's a lot easier to use, more memory friendly and also much faster.

    >
    > OTOH, choice of XML library is completely irrelevant for the issue at
    > hand.


    For the described problem, maybe. But certainly not for the application.
    The background was parsing the XML dump of an entire web site, which I
    would expect to be larger than what minidom is designed to handle
    gracefully. Switching to cElementTree before major code gets written is
    almost certainly a good idea here.

    Stefan
     
    Stefan Behnel, Mar 8, 2009
    #4
  5. > For the described problem, maybe. But certainly not for the application.
    > The background was parsing the XML dump of an entire web site, which I
    > would expect to be larger than what minidom is designed to handle
    > gracefully. Switching to cElementTree before major code gets written is
    > almost certainly a good idea here.


    I think minidom is designed to handle the very same documents taht
    cElementTree is designed to handle (namely, documents that fit into
    memory).

    Regards,
    Martin
     
    Martin v. Löwis, Mar 8, 2009
    #5
  6. comparing (c)ElementTree and minidom (was: Parsing unicode (devanagari)text with xml.dom.minidom)

    Martin v. Löwis wrote:
    >> The background was parsing the XML dump of an entire web site, which I
    >> would expect to be larger than what minidom is designed to handle
    >> gracefully. Switching to cElementTree before major code gets written is
    >> almost certainly a good idea here.

    >
    > I think minidom is designed to handle the very same documents taht
    > cElementTree is designed to handle (namely, documents that fit into
    > memory).


    I do not doubt that a machine running a cElementTree application can handle
    exactly the same documents as a machine with, say, ten times as much memory
    that runs a minidom application. However, when deciding which library to
    choose for a new application, it does matter what hardware you want to use
    it on. And if you can handle multiple times larger documents on the same
    hardware, that might be as much of reason to choose cElementTree as the
    (likely) shorter and more readable code (which usually translates into
    shorter development and debugging times) and the higher execution speed.
    Honestly, I haven't seen a reason in a while why preferring minidom over
    any of the ElementTree derivates would be a good idea when starting a new
    application.

    Stefan
     
    Stefan Behnel, Mar 8, 2009
    #6
  7. Guest

    On Mar 8, 12:42 am, Stefan Behnel <> wrote:
    > wrote:
    > > I am trying to process an xml file that contains unicode characters
    > > (seehttp://vyakarnam.wordpress.com/). Wordpress allows exporting the
    > > entire content of the website into an xml file. Using
    > > xml.dom.minidom,  I wrote a few lines of python code to parse out the
    > > xml file, but am stuck with the following error:

    >
    > >>>> import xml.dom.minidom
    > >>>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
    > >>>> titles = dom.getElementsByTagName("title")
    > >>>> for title in titles:

    > > ...    print "childNode = ", title.childNodes
    > > ...
    > > childNode =  [<DOM Text node "Sanskrit N...">]
    > > childNode =  [<DOM Text node "Sanskrit N...">]
    > > childNode =  []
    > > childNode =  []
    > > childNode =  [<DOM Text node "1-1-1">]
    > > childNode =  Traceback (most recent call last):
    > >   File "<stdin>", line 2, in <module>
    > > UnicodeEncodeError: 'ascii' codec can't encode characters in position
    > > 16-18: ordinal not in range(128)

    >
    > That's because you are printing it out to your console, in which case you
    > need to make sure it's encoded properly for printing. repr() might also help.
    >
    > Regarding minidom, you might be happier with the xml.etree package that
    > comes with Python2.5 and later (it's also avalable for older versions).
    > It's a lot easier to use, more memory friendly and also much faster.
    >
    > Stefan


    Thanks for the reply. I didn't realize that printing to console was
    causing the problem. I am now able to parse out the relevant portions
    of my xml file. Will also look at the xml.etree module.

    Regards
     
    , Mar 8, 2009
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mike McGavin
    Replies:
    5
    Views:
    743
    Mike McGavin
    Jan 19, 2005
  2. Greg Wogan-Browne
    Replies:
    1
    Views:
    877
    Uche Ogbuji
    Jan 28, 2005
  3. Replies:
    3
    Views:
    559
    Stefan Behnel
    Aug 3, 2007
  4. Atul.
    Replies:
    5
    Views:
    356
    Atul.
    Jul 28, 2008
  5. ming
    Replies:
    2
    Views:
    233
Loading...

Share This Page