Eurosymbol in xml document

H

Hellmut Weber

Hi,
i'm new here in this list.

i'm developing a little program using an xml document. So far it's easy
going, but when parsing an xml document which contains the EURO symbol
('¤') then I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\xa4' in
position 11834: character maps to <undefined>

the relevant piece of code is:

from xml.dom.minidom import Document, parse, parseString
...
doc = parse(inFIleName)


leo@brunello usexml $ locale
LANG=de_DE@euro
LC_CTYPE="de_DE@euro"
LC_NUMERIC="de_DE@euro"
LC_TIME="de_DE@euro"
LC_COLLATE="de_DE@euro"
LC_MONETARY="de_DE@euro"
LC_MESSAGES="de_DE@euro"
LC_PAPER="de_DE@euro"
LC_NAME="de_DE@euro"
LC_ADDRESS="de_DE@euro"
LC_TELEPHONE="de_DE@euro"
LC_MEASUREMENT="de_DE@euro"
LC_IDENTIFICATION="de_DE@euro"
LC_ALL=de_DE@euro


any help appreciated

Hellmut
 
D

Diez B. Roggisch

Hellmut said:
Hi,
i'm new here in this list.

i'm developing a little program using an xml document. So far it's easy
going, but when parsing an xml document which contains the EURO symbol
('€') then I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\xa4' in
position 11834: character maps to <undefined>

the relevant piece of code is:

from xml.dom.minidom import Document, parse, parseString
...
doc = parse(inFIleName)

The contents of the file must be encoded with the proper encoding which is
given in the XML-header, or has to be utf-8 if no header is given.

From the above I think you have a latin1-based document. Does the encoding
header match?

leo@brunello usexml $ locale
LANG=de_DE@euro
LC_CTYPE="de_DE@euro"
LC_NUMERIC="de_DE@euro"
LC_TIME="de_DE@euro"
LC_COLLATE="de_DE@euro"
LC_MONETARY="de_DE@euro"
LC_MESSAGES="de_DE@euro"
LC_PAPER="de_DE@euro"
LC_NAME="de_DE@euro"
LC_ADDRESS="de_DE@euro"
LC_TELEPHONE="de_DE@euro"
LC_MEASUREMENT="de_DE@euro"
LC_IDENTIFICATION="de_DE@euro"
LC_ALL=de_DE@euro

This is irrelevant.

Diez
 
S

Stephan Diehl

Hallo Helmut,
Hi,
i'm new here in this list.

i'm developing a little program using an xml document. So far it's easy
going, but when parsing an xml document which contains the EURO symbol
('€') then I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\xa4' in
position 11834: character maps to <undefined>

first of all, unicode handling is a little bit difficult, when encountered
the first time, but in the end, it really makes a lot of sense :)
Please read some python unicode tutorial like
http://www.amk.ca/python/howto/unicode

If you open up a python interactive prompt, you can do the following:Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac' in
position 0:

\u20ac is the unicode code point for the Euro sign, so u'\u20ac' is the
unicode euro sign in python. The different encode calls translate the
unicode into actual encodings.
What you are seeing in your xml document is the iso-8859-15 encoded euro
sign. As Diez already noted, you must make shure, that
1. the whole xml document is encoded in latin-15 and the encoding header
reflects that
or
2. make sure that the utf-8 encoded euro sign is in your xml document.

Hope that makes sense

Stephan
 
R

Robert Bossy

Diez said:
Hellmut Weber wrote:



The contents of the file must be encoded with the proper encoding which is
given in the XML-header, or has to be utf-8 if no header is given.

From the above I think you have a latin1-based document. Does the encoding
header match?
If the file is declared as latin-1 and contains an euro symbol, then the
file is actually invalid since euro is not defined of in iso-8859-1. If
there is no encoding declaration, as Diez already said, the file should
be encoded as utf-8.

Try replacing or adding the encoding with latin-15 (or iso-8859-15)
which is the same as latin-1 with a few changes, including the euro symbol:

<?xml version="1.0" encoding="iso-8859-15" ?>


If your file has lot of strange diacritics, you might take a look on the
little differences between latin-1 and latin-15 in order to make sure
that your file won't be broken:
http://en.wikipedia.org/wiki/ISO_8859-15

Cheers,
RB
 
D

Diez B. Roggisch

If the file is declared as latin-1 and contains an euro symbol, then the
file is actually invalid since euro is not defined of in iso-8859-1. If
there is no encoding declaration, as Diez already said, the file should
be encoded as utf-8.

You are right of course - latin1 doesn't contain the euro-symbol.
ISO-8859-15 it is. Dumb me.

Diez
 
R

Richard Brodie

If the file is declared as latin-1 and contains an euro symbol, then the file is
actually invalid since euro is not defined of in iso-8859-1.

Paradoxical would be a better description than invalid, if it contains
things that it can't contain. If you decoded iso-8859-15 as if it were
iso-8859-1, you would get u'\xa4' (Currency Sign) instead of the
Euro. From the original error:

"UnicodeEncodeError: 'charmap' codec can't encode character u'\xa4' in
position 11834: character maps to <undefined>"

that seems to be what happened, as you said.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,072
Latest member
trafficcone

Latest Threads

Top