xml processing and sys.setdefaultencoding

Discussion in 'Python' started by christof hoeke, Jul 20, 2003.

  1. hi,
    i wrote a small application which extracts a javadoc similar documentation
    for xslt stylesheets using python, xslt and pyana.
    using non-ascii characters was a problem. so i set the defaultending to
    UTF-8 and now everything works (at least it seems so, need to do more
    testing though).

    it may not be the most elegant solution (according to python in a nutshell)
    but it almost seems when doing xml processing it is mandatory to set the
    default encoding. xml processing should almost only work with unicode
    strings and this seems the easiest solution.

    any comments on this? better ways to work

    thanks
    chris
     
    christof hoeke, Jul 20, 2003
    #1
    1. Advertising

  2. christof hoeke

    Alan Kennedy Guest

    christof hoeke wrote:

    > i wrote a small application which extracts a javadoc similar
    > documentation
    > for xslt stylesheets using python, xslt and pyana.
    > using non-ascii characters was a problem.


    That's odd. Did your stylesheets contain non-ascii characters? If yes,
    did you declare the character encoding at the beginning of the
    document, e.g.

    "<?xml version="1.0" encoding="iso-8859-1"?>

    > so i set the [python] defaultending to
    > UTF-8 and now everything works (at least it seems so, need to do more
    > testing though).


    If you don't put an encoding declaration in your XML documents
    (including XSLT style/transform sheets), then an XML parser would by
    default treat the document content as UTF-(8|16), as the XML standard
    mandates.

    Are you working from XML documents which are stored as strings inside
    a python module? In which case, your special characters will actually
    be encoded in whatever encoding your python module is stored. So you
    might need to put an encoding declaration on your python module:-

    http://www.python.org/peps/pep-0263.html

    > it may not be the most elegant solution (according to python in a
    > nutshell)
    > but it almost seems when doing xml processing it is mandatory to set the
    > default encoding. xml processing should almost only work with unicode
    > strings and this seems the easiest solution.


    It is always recommended to explicitly state the encoding on your XML
    documents. If you don't, then the parser assumes UTF-(8|16). If your
    documents aren't really UTF-(8|16), then you will get seemingly random
    mapping of characters to other characters.

    > any comments on this? better ways to work


    If you're not dealing specifically with ASCII, then declare your
    encodings, in both your python modules and your xml documents. Find
    out what is the default character set used by your text editor. Find
    out how to change which character set is in use.

    If you create, sell or maintain text editing or processing software,
    make it easy for your users to find out what character encodings are
    in effect.

    HTH,

    --
    alan kennedy
    -----------------------------------------------------
    check http headers here: http://xhaus.com/headers
    email alan: http://xhaus.com/mailto/alan
     
    Alan Kennedy, Jul 20, 2003
    #2
    1. Advertising

  3. "christof hoeke" <> writes:

    > using non-ascii characters was a problem. so i set the defaultending to
    > UTF-8 and now everything works (at least it seems so, need to do more
    > testing though).


    Can you please be more precise as to what problem exactly you have
    observed?

    Regards,
    Martin
     
    Martin v. =?iso-8859-15?q?L=F6wis?=, Jul 20, 2003
    #3
  4. Re: xml processing and sys.setdefaultencoding (more info)

    hi,
    first thanks for the infos. i need to try the encoding declaration in the
    python module.

    some more details about the problem i had (regarding the posts by Alan and
    Martin):

    the original problem with the app was that the Pyana transformation
    complained about the string "xml" when it came over as unicode. so i used
    str(xml) but that gave the usual "ordinal not in range" error when the xslt
    contained e.g. german umlauts. i did not tried that before...
    setting the default encoding to utf-8 fixed that. the reason is not entirely
    clear to me yet though.

    - the used xslt stylesheets should have been in utf-8 as i did not state an
    encoding explicitly
    - xslt with latin-1 (iso8859-1) encoding should work too though
    - xslt contains german umlauts öäü etc.
    - i did extract parts of the xslt in python strings, yes

    i read the other threads about unicode and also about PEP 0263. i have not
    tried to set the encoding of the python file yet. but sounds promising.
    i am wondering though, if i set the python file encoding to e.g. utf-8 and
    then use a stylesheet with, lets say latin-1 encoding, i still have a
    mismatch, havn't i?

    if you are interested in the code, download it from
    http://cthedot.de/pyxsldoc/
    it is my first "bigger" python project, so the code is not the best i guess
    and the version which does not work is still online. i need to put on the
    version with the changed default encoding.

    chris



    christof hoeke wrote:
    > hi,
    > i wrote a small application which extracts a javadoc similar
    > documentation for xslt stylesheets using python, xslt and pyana.
    > using non-ascii characters was a problem. so i set the defaultending
    > to UTF-8 and now everything works (at least it seems so, need to do
    > more testing though).
    >
    > it may not be the most elegant solution (according to python in a
    > nutshell) but it almost seems when doing xml processing it is
    > mandatory to set the default encoding. xml processing should almost
    > only work with unicode strings and this seems the easiest solution.
    >
    > any comments on this? better ways to work
    >
    > thanks
    > chris
     
    christof hoeke, Jul 20, 2003
    #4
  5. Re: xml processing and sys.setdefaultencoding (more info)

    "christof hoeke" <> writes:

    > the original problem with the app was that the Pyana transformation
    > complained about the string "xml" when it came over as unicode. so i used
    > str(xml) but that gave the usual "ordinal not in range" error when the xslt
    > contained e.g. german umlauts.


    At that point, you should have done

    xml = xml.encode("utf-8")

    where you might need to make sure that the string "utf-8" matches the
    encoding= given in the xml header.

    > i did not tried that before... setting the default encoding to
    > utf-8 fixed that. the reason is not entirely clear to me yet though.


    For any Unicode object X, str(X) is equivalent to
    X.encode(sys.getdefaultencoding()). Since that defaults to "ascii",
    str(X) is normally the same as X.encode("ascii"), which fails if you
    have non-ASCII in your string.

    > it is my first "bigger" python project, so the code is not the best
    > i guess and the version which does not work is still online. i need
    > to put on the version with the changed default encoding.


    I advise that you get rid of the need to set the default
    encoding. Many users will have set this to a value different from
    "utf-8".

    Regards,
    Martin
     
    Martin v. =?iso-8859-15?q?L=F6wis?=, Jul 21, 2003
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Askari

    sys.setdefaultencoding(name)

    Askari, Sep 18, 2004, in forum: Python
    Replies:
    5
    Views:
    5,332
    Askari
    Sep 20, 2004
  2. Robin Becker

    sys.setdefaultencoding

    Robin Becker, Aug 28, 2007, in forum: Python
    Replies:
    1
    Views:
    379
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Aug 28, 2007
  3. smalltalk

    setdefaultencoding error

    smalltalk, Dec 8, 2007, in forum: Python
    Replies:
    2
    Views:
    1,098
    smalltalk
    Dec 10, 2007
  4. crow
    Replies:
    5
    Views:
    701
    Terry Reedy
    Jul 9, 2010
  5. Hans-Peter Jansen
    Replies:
    0
    Views:
    76
    Hans-Peter Jansen
    Dec 3, 2013
Loading...

Share This Page