xml processing and sys.setdefaultencoding

C

christof hoeke

hi,
i wrote a small application which extracts a javadoc similar documentation
for xslt stylesheets using python, xslt and pyana.
using non-ascii characters was a problem. so i set the defaultending to
UTF-8 and now everything works (at least it seems so, need to do more
testing though).

it may not be the most elegant solution (according to python in a nutshell)
but it almost seems when doing xml processing it is mandatory to set the
default encoding. xml processing should almost only work with unicode
strings and this seems the easiest solution.

any comments on this? better ways to work

thanks
chris
 
A

Alan Kennedy

christof said:
i wrote a small application which extracts a javadoc similar
documentation
for xslt stylesheets using python, xslt and pyana.
using non-ascii characters was a problem.

That's odd. Did your stylesheets contain non-ascii characters? If yes,
did you declare the character encoding at the beginning of the
document, e.g.

so i set the [python] defaultending to
UTF-8 and now everything works (at least it seems so, need to do more
testing though).

If you don't put an encoding declaration in your XML documents
(including XSLT style/transform sheets), then an XML parser would by
default treat the document content as UTF-(8|16), as the XML standard
mandates.

Are you working from XML documents which are stored as strings inside
a python module? In which case, your special characters will actually
be encoded in whatever encoding your python module is stored. So you
might need to put an encoding declaration on your python module:-

http://www.python.org/peps/pep-0263.html
it may not be the most elegant solution (according to python in a
nutshell)
but it almost seems when doing xml processing it is mandatory to set the
default encoding. xml processing should almost only work with unicode
strings and this seems the easiest solution.

It is always recommended to explicitly state the encoding on your XML
documents. If you don't, then the parser assumes UTF-(8|16). If your
documents aren't really UTF-(8|16), then you will get seemingly random
mapping of characters to other characters.
any comments on this? better ways to work

If you're not dealing specifically with ASCII, then declare your
encodings, in both your python modules and your xml documents. Find
out what is the default character set used by your text editor. Find
out how to change which character set is in use.

If you create, sell or maintain text editing or processing software,
make it easy for your users to find out what character encodings are
in effect.

HTH,
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

christof hoeke said:
using non-ascii characters was a problem. so i set the defaultending to
UTF-8 and now everything works (at least it seems so, need to do more
testing though).

Can you please be more precise as to what problem exactly you have
observed?

Regards,
Martin
 
C

christof hoeke

hi,
first thanks for the infos. i need to try the encoding declaration in the
python module.

some more details about the problem i had (regarding the posts by Alan and
Martin):

the original problem with the app was that the Pyana transformation
complained about the string "xml" when it came over as unicode. so i used
str(xml) but that gave the usual "ordinal not in range" error when the xslt
contained e.g. german umlauts. i did not tried that before...
setting the default encoding to utf-8 fixed that. the reason is not entirely
clear to me yet though.

- the used xslt stylesheets should have been in utf-8 as i did not state an
encoding explicitly
- xslt with latin-1 (iso8859-1) encoding should work too though
- xslt contains german umlauts öäü etc.
- i did extract parts of the xslt in python strings, yes

i read the other threads about unicode and also about PEP 0263. i have not
tried to set the encoding of the python file yet. but sounds promising.
i am wondering though, if i set the python file encoding to e.g. utf-8 and
then use a stylesheet with, lets say latin-1 encoding, i still have a
mismatch, havn't i?

if you are interested in the code, download it from
http://cthedot.de/pyxsldoc/
it is my first "bigger" python project, so the code is not the best i guess
and the version which does not work is still online. i need to put on the
version with the changed default encoding.

chris
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

christof hoeke said:
the original problem with the app was that the Pyana transformation
complained about the string "xml" when it came over as unicode. so i used
str(xml) but that gave the usual "ordinal not in range" error when the xslt
contained e.g. german umlauts.

At that point, you should have done

xml = xml.encode("utf-8")

where you might need to make sure that the string "utf-8" matches the
encoding= given in the xml header.
i did not tried that before... setting the default encoding to
utf-8 fixed that. the reason is not entirely clear to me yet though.

For any Unicode object X, str(X) is equivalent to
X.encode(sys.getdefaultencoding()). Since that defaults to "ascii",
str(X) is normally the same as X.encode("ascii"), which fails if you
have non-ASCII in your string.
it is my first "bigger" python project, so the code is not the best
i guess and the version which does not work is still online. i need
to put on the version with the changed default encoding.

I advise that you get rid of the need to set the default
encoding. Many users will have set this to a value different from
"utf-8".

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top