doc.toxml() gives ASCII encoding error

Discussion in 'Python' started by Jim Hefferon, Feb 18, 2004.

  1. Jim Hefferon

    Jim Hefferon Guest

    Hello,

    I'm having trouble with .xml files that have non-ascii characters.
    Here is a small example.

    .................................
    #!/usr/bin/python2.2
    import sys, os, os.path, re
    import xml.dom.minidom

    doc=xml.dom.minidom.parse(sys.argv[1])
    print doc.toxml()
    ................................

    On an .xml that contains only ascii characters, it works just fine.
    But in one of my documents is the string
    <name>Martin Schröder</name>
    and running the above script on that file gives:
    Traceback (most recent call last):
    File "/home/web/catalogue_read.py", line 6, in ?
    print doc.toxml()
    UnicodeError: ASCII encoding error: ordinal not in range(128)

    I had the idea that the parser reads the xml declaration in the .xml
    file (it is UTF-8), encodes the text parts into whatever is the
    internal representation for unicode, and then .toxml sends it back out
    again as a python unicode string. But I can't reconcile that idea
    with this outcome.

    I'm simply lost; can anyone tell me what (no doubt clueless) thing
    that I am
    doing wrong? I'm running under Fedora, so I have python 2.2, if
    that's any help.

    Thanks,
    Jim
     
    Jim Hefferon, Feb 18, 2004
    #1
    1. Advertising

  2. Jim Hefferon wrote:

    > I'm simply lost; can anyone tell me what (no doubt clueless) thing
    > that I am
    > doing wrong?


    You are not doing anything wrong; it's a bug. Try Python 2.3, or
    try PyXML.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 18, 2004
    #2
    1. Advertising

  3. Jim Hefferon

    Jim Hefferon Guest

    "Martin v. Löwis" <> wrote
    > You are not doing anything wrong; it's a bug. Try Python 2.3, or
    > try PyXML.
    >

    Thanks. I understand that getting 2.3 to go on Fedora is non-trivial
    (although I recently saw that RPM's are now available, so maybe now is
    my chance).

    I've decided that doc.toxml().encode('UTF-8') is what I want. I have
    to admit that while I have gotten used to thinking of modules as black
    boxes, the XML stuff seems to me to be such a big box that I often am
    not sure just what I want to do. I don't think I have the whole
    infoset thing inside my brain yet.

    Thanks again,
    Jim
     
    Jim Hefferon, Feb 19, 2004
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matt
    Replies:
    3
    Views:
    522
    Tor Iver Wilhelmsen
    Sep 17, 2004
  2. Paul Boddie
    Replies:
    0
    Views:
    1,358
    Paul Boddie
    Jun 24, 2003
  3. Ray
    Replies:
    3
    Views:
    926
  4. Manuel Ospina

    tag replacement in toxml()

    Manuel Ospina, Apr 1, 2007, in forum: Python
    Replies:
    1
    Views:
    363
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 1, 2007
  5. Ron Vecchi

    Custom Class ToXml(), FromXml() methods

    Ron Vecchi, Jun 25, 2004, in forum: ASP .Net Web Services
    Replies:
    2
    Views:
    136
    Ron Vecchi
    Jun 28, 2004
Loading...

Share This Page