Incorrect parsing of special characters

Dario Di Bella · Jun 17, 2004

Hi all,
I hope someone can help me on this. I need to parse the following XML:

....
<area name="promotore">
<item id="004" code="003" description="attivita promotore">
<![CDATA[» Attività Promotore]]>
</item>
</area>
....

As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

Â» AttivitÃ  Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

I'm using the javax.xml.parsers.DocumentBuilder parser.

Has anyone got any clue? Thanks.

Dario

Bjoern Hoehrmann · Jun 17, 2004

* Dario Di Bella wrote in comp.text.xml:

As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

Â» AttivitÃ  Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

I'm using the javax.xml.parsers.DocumentBuilder parser.

Has anyone got any clue? Thanks.

The output seems to be UTF-8 which you view in some application that
assumes the output is ISO-8859-1 or similar encoded. Could you
elaborate which kind of problem you are trying to solve? Everything
should be fine as long as the second application supports UTF-8 and
knows that the data is UTF-8 encoded.

Dario Di Bella · Jun 18, 2004

Bjoern,
Thanks for your reply.
I am building a jsp tag library to build a dynamic javascript menu on
a web page. The javascript code is mm_menu.js shipped into
Dreamweaver. The menu items should be dynamically loaded after the
user login, based on the user permissions (i.e. some menu items will
be enabled, some other won't). The common menu configuration is stored
in an xml file. Each <item> tag represents a menu item. The CDATA
section is the text that will be displayed on the page, hence contains
html specific codes (" ") and some special characters currenty
used in the italian language ("à").

Basically what I want to do is to print that CDATA section in an HTML
page, using a jsp custom tag.

I don't understand your observation regarding a second application:
even if I parse the xml and echo the nodes content on the system
output (i.e. System.out.println(element.getData())

I obtain the same
wrong text.

Any suggestion?
Thanks and regards.

Dario.

Thomas Weidenfeller · Jun 18, 2004

Dario said:
<![CDATA[» Attività Promotore]]>
Â» AttivitÃ  Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

Check your charset encoding. This looks very much as if the encoding in
which the XML comes and the encoding used to read it don't match.

/Thomas

Michael Borgwardt · Jun 18, 2004

Dario said:
As you can see I used the CDATA section to include special characters.
Unfortunately as I parse the file, the "item" element content turns to
be:

Â» AttivitÃ  Promotore

i.e. the "Â" character is inserted at the beginning of the string and
the "à" character is translated into "Ã ".

Does your document correctly declare its encoding? If you specify
none, the default is UTF-8 whereas Windows text editors usually
default to CP1252. Trying to parse CP1252-encoded text as UTF-8
could easily lead to the weirdness you describe.

Bjoern Hoehrmann · Jun 18, 2004

* Dario Di Bella wrote in comp.text.xml:

Basically what I want to do is to print that CDATA section in an HTML
page, using a jsp custom tag.

To make that work you need to ensure that the HTML document and the
output of your code use the same encoding. It's all about the bytes.
You could try to copy and paste the following fragment into a HTML
document

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<meta http-equiv=Content-Type content="text/html;charset=utf-8">
<title></title>
<p>Â» AttivitÃ  Promotore</p>

and load that into your browser. All your characters should be just
fine. If you change the fragment to

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<meta http-equiv=Content-Type content="text/html;charset=iso-8859-1">
<title></title>
<p>Â» AttivitÃ  Promotore</p>

It breaks. So you need to either change the encoding and/or
declaration of the encoding of the surrounding HTML document
or you need to transcode the data or you can try to use character
references.

The java.lang.String object for example provides a getBytes(...)
method, you can do e.g. the following:

class Foo{public static void main(String[] argv){try{

System.out.write("\u00f6".getBytes("UTF-8"));
System.out.println();
System.out.write("\u00f6".getBytes("ISO-8859-1"));
System.out.println();
System.out.write(0x94); /* CP850, but it is not supported... */
System.out.println();

} catch (Exception e) {e.printStackTrace();}}}

Depending on your operating system, locales, etc. one of the writes
will most likely show an "ö". If you work on Windows, on the command
line most likely the last write(), if you redirect the output to a
file (`java Foo > file.txt`) and open file.txt in Notepad, you would
notice that it is now the second write() that shows the "ö". You can
also create a new text file containing "ö" and go to the command line
prompt and type "type C:\...\file.txt" which would then likely show
"÷" not "ö".

HTH...

Dario Di Bella · Jun 18, 2004

Bjoern/Michael/Thomas,

I solved this issue declaring a different encoding ("iso-8859-1"
instead of "utf-8"). Thank you very much for your help, and excuse me
for bothering you with a trivial problem ;-)

Best regards.

Dario.

Issue: special characters	0	Jul 15, 2011
SAXParser and preserving special characters	0	Oct 31, 2003
Handling (retain) special characters when parsing XML?	1	Apr 5, 2007
Invalid byte 2 of 3-byte UTF-8 sequence - inconsistent behavior	6	Nov 15, 2007
Problems with special characters (I suppose it is 'locales')	1	Nov 21, 2003
Parse XML file on Linux faled because of special characters	2	Jan 1, 2008
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
Xah's edu corner: the Journey of Foreign Characters thru Internet	13	Nov 1, 2005

Incorrect parsing of special characters

Dario Di Bella

Bjoern Hoehrmann

Dario Di Bella

Thomas Weidenfeller

Michael Borgwardt

Bjoern Hoehrmann

Dario Di Bella

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads