Eurosymbol in xml document

Hellmut Weber · Mar 4, 2008

Hi,
i'm new here in this list.

i'm developing a little program using an xml document. So far it's easy
going, but when parsing an xml document which contains the EURO symbol
('¤') then I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\xa4' in
position 11834: character maps to <undefined>

the relevant piece of code is:

from xml.dom.minidom import Document, parse, parseString
...
doc = parse(inFIleName)

leo@brunello usexml $ locale
LANG=de_DE@euro
LC_CTYPE="de_DE@euro"
LC_NUMERIC="de_DE@euro"
LC_TIME="de_DE@euro"
LC_COLLATE="de_DE@euro"
LC_MONETARY="de_DE@euro"
LC_MESSAGES="de_DE@euro"
LC_PAPER="de_DE@euro"
LC_NAME="de_DE@euro"
LC_ADDRESS="de_DE@euro"
LC_TELEPHONE="de_DE@euro"
LC_MEASUREMENT="de_DE@euro"
LC_IDENTIFICATION="de_DE@euro"
LC_ALL=de_DE@euro

any help appreciated

Hellmut

Diez B. Roggisch · Mar 4, 2008

Hellmut said:
Hi,
i'm new here in this list.

i'm developing a little program using an xml document. So far it's easy
going, but when parsing an xml document which contains the EURO symbol
('â‚¬') then I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\xa4' in
position 11834: character maps to <undefined>

the relevant piece of code is:

from xml.dom.minidom import Document, parse, parseString
...
doc = parse(inFIleName)

The contents of the file must be encoded with the proper encoding which is
given in the XML-header, or has to be utf-8 if no header is given.

From the above I think you have a latin1-based document. Does the encoding
header match?

leo@brunello usexml $ locale
LANG=de_DE@euro
LC_CTYPE="de_DE@euro"
LC_NUMERIC="de_DE@euro"
LC_TIME="de_DE@euro"
LC_COLLATE="de_DE@euro"
LC_MONETARY="de_DE@euro"
LC_MESSAGES="de_DE@euro"
LC_PAPER="de_DE@euro"
LC_NAME="de_DE@euro"
LC_ADDRESS="de_DE@euro"
LC_TELEPHONE="de_DE@euro"
LC_MEASUREMENT="de_DE@euro"
LC_IDENTIFICATION="de_DE@euro"
LC_ALL=de_DE@euro

This is irrelevant.

Diez

Stephan Diehl · Mar 4, 2008

Hallo Helmut,

Hi,
i'm new here in this list.

i'm developing a little program using an xml document. So far it's easy
going, but when parsing an xml document which contains the EURO symbol
('â‚¬') then I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\xa4' in
position 11834: character maps to <undefined>

first of all, unicode handling is a little bit difficult, when encountered
the first time, but in the end, it really makes a lot of sense

Please read some python unicode tutorial like
http://www.amk.ca/python/howto/unicode

If you open up a python interactive prompt, you can do the following:Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac' in
position 0:

\u20ac is the unicode code point for the Euro sign, so u'\u20ac' is the
unicode euro sign in python. The different encode calls translate the
unicode into actual encodings.
What you are seeing in your xml document is the iso-8859-15 encoded euro
sign. As Diez already noted, you must make shure, that
1. the whole xml document is encoded in latin-15 and the encoding header
reflects that
or
2. make sure that the utf-8 encoded euro sign is in your xml document.

Hope that makes sense

Stephan

Robert Bossy · Mar 4, 2008

Diez said:
Hellmut Weber wrote:

The contents of the file must be encoded with the proper encoding which is
given in the XML-header, or has to be utf-8 if no header is given.

From the above I think you have a latin1-based document. Does the encoding
header match?

If the file is declared as latin-1 and contains an euro symbol, then the
file is actually invalid since euro is not defined of in iso-8859-1. If
there is no encoding declaration, as Diez already said, the file should
be encoded as utf-8.

Try replacing or adding the encoding with latin-15 (or iso-8859-15)
which is the same as latin-1 with a few changes, including the euro symbol:

<?xml version="1.0" encoding="iso-8859-15" ?>

If your file has lot of strange diacritics, you might take a look on the
little differences between latin-1 and latin-15 in order to make sure
that your file won't be broken:
http://en.wikipedia.org/wiki/ISO_8859-15

Cheers,
RB

Diez B. Roggisch · Mar 4, 2008

If the file is declared as latin-1 and contains an euro symbol, then the

file is actually invalid since euro is not defined of in iso-8859-1. If
there is no encoding declaration, as Diez already said, the file should
be encoded as utf-8.

You are right of course - latin1 doesn't contain the euro-symbol.
ISO-8859-15 it is. Dumb me.

Diez

Richard Brodie · Mar 4, 2008

If the file is declared as latin-1 and contains an euro symbol, then the file is
actually invalid since euro is not defined of in iso-8859-1.

Paradoxical would be a better description than invalid, if it contains
things that it can't contain. If you decoded iso-8859-15 as if it were
iso-8859-1, you would get u'\xa4' (Currency Sign) instead of the
Euro. From the original error:

"UnicodeEncodeError: 'charmap' codec can't encode character u'\xa4' in
position 11834: character maps to <undefined>"

that seems to be what happened, as you said.

Module locale throws exception: unsupported locale setting	2	Nov 19, 2010
Unicode I/O	10	Apr 13, 2008
how to use unicode in c under linux?	9	Sep 13, 2008
print u"\u0432": why is this so hard? UnciodeEncodeError	28	Apr 8, 2004
locale.getlocale() strange behaviour	2	Mar 2, 2004
Locale confusion	2	Jan 7, 2005
Parsing unicode (devanagari) text with xml.dom.minidom	6	Mar 8, 2009
XML/HTML Encoding problem	3	May 22, 2006

Eurosymbol in xml document

Hellmut Weber

Diez B. Roggisch

Stephan Diehl

Robert Bossy

Diez B. Roggisch

Richard Brodie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads