java xml parsing, euro signs disapear

F

flm

I've got an XML document that contains euro signs and looks like :

<?xml version="1.0" encoding="utf-8"?>
<merchant id="52">
<product
offerid="03543068131"
deliverycost="6,90 €"
/>
....

I use this bit of Java (jdk 1.4.2) code to parse it :

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( file_ );

The problem is the euro signs are transformed into the charactere '?'
(printing the value of a getAttribute( "deliverycost" ) gives ? on a
utf-8 terminal)

Thanks for any help,
FL
 
A

Arnaud Berger

Hi,

Try using the encoding : "ISO-8859-15"

Regards,

Arnaud

"flm" <[email protected]> a écrit dans le message (e-mail address removed)...
I've got an XML document that contains euro signs and looks like :

<?xml version="1.0" encoding="utf-8"?>
<merchant id="52">
<product
offerid="03543068131"
deliverycost="6,90 ?"
/>
....

I use this bit of Java (jdk 1.4.2) code to parse it :

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( file_ );

The problem is the euro signs are transformed into the charactere '?'
(printing the value of a getAttribute( "deliverycost" ) gives ? on a
utf-8 terminal)

Thanks for any help,
FL
 
F

flm

No, the xml file is UTF-8 encoded. What's matter is the parser output.
After some googling I found out that the euro sign is not part of the
ISO-8859-1 charset which is the one used by the VM on my system so I
guess it's normal that the parser change the euro signs into '?'.
The pb now is how can I change this behavior ?
calling java with -Dfile.encoding=UTF-8 doesn't change anything.

Regard,
FLM
 
A

Arnaud Berger

Sorry, but if the file is correctly "UTF-8 encoded", then there can't be any
euro symbol in it....
This because this symbol isn't part of UTF-8

Regards,

Arnaud
 
F

flm

UTF-8 encode the euro symbol with 20ac in hexa. That's what I've got in
my XML. Anyway, if I replace that by & # 8364; (without the spaces) or
& # x20ac; I get the same behaviour.

regards,
FLM
 
T

Thomas Weidenfeller

Arnaud said:
Sorry, but if the file is correctly "UTF-8 encoded", then there can't be any
euro symbol in it....
This because this symbol isn't part of UTF-8

Of course it is. The Euro symbol, and even the stillborn Ecu has a place
in Unicode UTF-8.

http://www.unicode.org/

/Thomas
 
T

Thomas Weidenfeller

flm said:
No, the xml file is UTF-8 encoded. What's matter is the parser output.
After some googling I found out that the euro sign is not part of the
ISO-8859-1 charset which is the one used by the VM on my system so I
guess it's normal that the parser change the euro signs into '?'.

No, Java always uses unicode internally for strings and chars. The
ISO-8859-1 is just the charset Java uses for I/O conversions in many,
but not all places.

The SAX parser does not really care about this I/O charset. Unless you
use an InputSource with a Reader it should take the text encoding
declaration from the input XML byte stream.

Your problem is most likely not how you parse the XML data, but how you
try to print out the character. So, would you please grap a debugger and
verify the result of the parser. We here can only guess.

/Thomas
 
F

flm

Looking at the content of my variable with jdb gives me '6,90 ?'.
Looking at the binary content with 'od -t x1' of a System.out.println
gives me 3f which is the value for a '?'.
So the getAttribute gives me definitely a '?'.
 
J

John McGrath

Looking at the content of my variable with jdb gives me '6,90 ?'.
Looking at the binary content with 'od -t x1' of a System.out.println
gives me 3f which is the value for a '?'.
So the getAttribute gives me definitely a '?'.

In both cases, you are writing the character to System.out, which results
in the character being encoded to ISO-8859-1. Since the character cannot
be represented in ISO-8859-1, it is converted to "?". Looking at it after
it has been trashed will not tell you anything.

Try printing out the character using:

System.out.println( Integer.toHexString( ch ) );
 
F

flm

You are right, I do have 20ac in hexa. So the conversion occurs when I
insert my data into the DB. Then I just need to find out how to change
this behaviour.
Thanks.

Regards,
FLM
 
J

John McGrath

So the conversion occurs when I insert my data into the DB. Then I
just need to find out how to change this behaviour.

This depends on the the JDBC driver and the database.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top