Transforming XML containing Asian characters?

mikeyjudkins · Jun 8, 2005

I have an XML file containing localized strings in 9 languages, encoded
in Unicode (UTF-8). Im trying to parse this XML document via XSLT
(Apache Xalan) to selectively render localized strings depending on a
users selected language.

The problem Im running into is that when the XML document is sent
through the XSLT stylesheet, all European special characters (such as
umlauts, accents, etc) are converted to html character entities (as
expected behavior), however, the Asian character sets are shown as
question marks in the page source. It seems as if the XSLT engine does
not know how to convert the Asian Unicode strings to useable character
entities (or it is trying incorrectly to convert them to something that
then the browser cannot understand).

As an example, below is the output I get using the simplest of XSLT to
take the localized XML and output it to a UTF-8 encoded html file. Note
the 2 lines in question marks, which in the XML source, appear
correctly as Japanese and Chinese.

French
Français
Französisch
Français
Frans
Francese
???????????????
??????
Francés

Been banging my head against my desk for awhile now. Any ideas
thankfully accepted!

mikeyjudkins · Jun 8, 2005

For some reason when I set the xsl

utput encoding attribute to
"UTF-8", Xalan does not convert Asian symbols into numeric entity
codes. It keeps the symbols in Unicode. But it does convert European
double byte characters to numeric entities. Is there any way around
this without changing the xsl

utput encoding? I need it to convert all
double byte characters (including Chinese, Japanese and European) into
entity codes, while still maintaining a UTF-8 output encoding.

Is this possible?

David Carlisle · Jun 8, 2005

For some reason when I set the xslutput encoding attribute to
"UTF-8", Xalan does not convert Asian symbols into numeric entity
codes. It keeps the symbols in Unicode. But it does convert European
double byte characters to numeric entities. Is there any way around
this without changing the xslutput encoding? I need it to convert all
double byte characters (including Chinese, Japanese and European) into
entity codes, while still maintaining a UTF-8 output encoding.

Is this possible?

specify US-ASCII as the output. The end result is then as you wish: the
file itself will be ascii encoded (which is teh same as utf8 encoding
for ascii characters) and all non ascii characters will be accessed by
reference.

David

mikeyjudkins · Jun 8, 2005

That works, but then if I have any Unicode characters on the page, it
shows up as garble (since these are not in the US-ASCII range). You
probably are wondering why would there be any unicode characters on the
page if they are all converted into entities during the XSLT
conversion.

This is because not all the content on the page is being output by the
XSLT template (we have a JSP page containing a java bean which in turn
calls the Xalan engine and outputs the results to the page) but it is
only a portion of the larger page, which contains Chinese/Japanse
Unicode characters.

The output type set on the XSL template (US-ASCII) then overrides the
encoding of the jsp page (UTF-8) and although we see the correct
conversions within the XSLT output "zone", all the Unicode characters
outside this zone are no longer readable.

What I really need to know is, is there a way to keep the UTF-8 output
encoding and have Xalan still convert Chinese/Japanese characters into
entities? It seems like it should do this by default, as these symbols
are just another character in the Unicode range.

Thanks,

Mike

David Carlisle · Jun 8, 2005

That works, but then if I have any Unicode characters on the page, it
shows up as garble (since these are not in the US-ASCII range). You
probably are wondering why would there be any unicode characters on the
page if they are all converted into entities during the XSLT
conversion.

character references (which are not entity references, but yes)

This is because not all the content on the page is being output by the
XSLT template (we have a JSP page containing a java bean which in turn
calls the Xalan engine and outputs the results to the page) but it is
only a portion of the larger page, which contains Chinese/Japanse
Unicode characters.

So in that case, what's the problem with the XSLT derived portions being
in utf8?

The output type set on the XSL template (US-ASCII) then overrides the
encoding of the jsp page (UTF-8) and although we see the correct
conversions within the XSLT output "zone", all the Unicode characters
outside this zone are no longer readable.

What I really need to know is, is there a way to keep the UTF-8 output
encoding and have Xalan still convert Chinese/Japanese characters into
entities?

I'm not sure abut xalan, saxon has extension attributes on xsl

utput
that can control this.

In XSLT1 I have in the past specified us-ascii (to get non ascii
characters as references) and then just removed the encoding declaration
in the result in a post process (with sed/per/whatever) that way the
files will be parsed as utf8 and utf8 characters will work.

In xslt2 (eg saxon8) you will be able to specify us-ascii and also
specify that no xml declaration is output, this is explictly for use
cases such as yours where the output from xslt needs to be merged with
other things.

It seems like it should do this by default, as these symbols
are just another character in the Unicode range.

All XML characters fit this description. If you specify an encoding that
includes the characters (and utf8 includes them all) then the normal
behaviour is that the characters are output as character data (You
indicate that accented letters are being output as numeric references,
which would be surprising, but conformant behaviour)

suppressing bad characters in output PCDATA (converting JSON to XML)	6	Nov 25, 2011
[ANN] Syncro Soft Announces New Release of Oxygen XML Editor andOxygen XML Author	0	Jan 26, 2011
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Unicode characters, XML/RSS	1	Jul 31, 2008
Py3: Read file with Unicode characters	4	Apr 8, 2010
Problems with UTF-8 characters and XSLT	2	Jun 30, 2005
can I use element tree for handling special characters in xml text?	1	Jul 27, 2011
Displaying Non-ASCII Characters in C++	8	Dec 5, 2007

Transforming XML containing Asian characters?

mikeyjudkins

mikeyjudkins

David Carlisle

mikeyjudkins

David Carlisle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads