Transforming XML containing Asian characters?

M

mikeyjudkins

I have an XML file containing localized strings in 9 languages, encoded
in Unicode (UTF-8). Im trying to parse this XML document via XSLT
(Apache Xalan) to selectively render localized strings depending on a
users selected language.

The problem Im running into is that when the XML document is sent
through the XSLT stylesheet, all European special characters (such as
umlauts, accents, etc) are converted to html character entities (as
expected behavior), however, the Asian character sets are shown as
question marks in the page source. It seems as if the XSLT engine does
not know how to convert the Asian Unicode strings to useable character
entities (or it is trying incorrectly to convert them to something that
then the browser cannot understand).

As an example, below is the output I get using the simplest of XSLT to
take the localized XML and output it to a UTF-8 encoded html file. Note
the 2 lines in question marks, which in the XML source, appear
correctly as Japanese and Chinese.

French
Français
Französisch
Français
Frans
Francese
???????????????
??????
Francés

Been banging my head against my desk for awhile now. Any ideas
thankfully accepted!
 
M

mikeyjudkins

For some reason when I set the xsl:eek:utput encoding attribute to
"UTF-8", Xalan does not convert Asian symbols into numeric entity
codes. It keeps the symbols in Unicode. But it does convert European
double byte characters to numeric entities. Is there any way around
this without changing the xsl:eek:utput encoding? I need it to convert all
double byte characters (including Chinese, Japanese and European) into
entity codes, while still maintaining a UTF-8 output encoding.

Is this possible?
 
D

David Carlisle

For some reason when I set the xsl:eek:utput encoding attribute to
"UTF-8", Xalan does not convert Asian symbols into numeric entity
codes. It keeps the symbols in Unicode. But it does convert European
double byte characters to numeric entities. Is there any way around
this without changing the xsl:eek:utput encoding? I need it to convert all
double byte characters (including Chinese, Japanese and European) into
entity codes, while still maintaining a UTF-8 output encoding.

Is this possible?

specify US-ASCII as the output. The end result is then as you wish: the
file itself will be ascii encoded (which is teh same as utf8 encoding
for ascii characters) and all non ascii characters will be accessed by
reference.

David
 
M

mikeyjudkins

That works, but then if I have any Unicode characters on the page, it
shows up as garble (since these are not in the US-ASCII range). You
probably are wondering why would there be any unicode characters on the
page if they are all converted into entities during the XSLT
conversion.

This is because not all the content on the page is being output by the
XSLT template (we have a JSP page containing a java bean which in turn
calls the Xalan engine and outputs the results to the page) but it is
only a portion of the larger page, which contains Chinese/Japanse
Unicode characters.

The output type set on the XSL template (US-ASCII) then overrides the
encoding of the jsp page (UTF-8) and although we see the correct
conversions within the XSLT output "zone", all the Unicode characters
outside this zone are no longer readable.

What I really need to know is, is there a way to keep the UTF-8 output
encoding and have Xalan still convert Chinese/Japanese characters into
entities? It seems like it should do this by default, as these symbols
are just another character in the Unicode range.

Thanks,

Mike
 
D

David Carlisle

That works, but then if I have any Unicode characters on the page, it
shows up as garble (since these are not in the US-ASCII range). You
probably are wondering why would there be any unicode characters on the
page if they are all converted into entities during the XSLT
conversion.
character references (which are not entity references, but yes)
This is because not all the content on the page is being output by the
XSLT template (we have a JSP page containing a java bean which in turn
calls the Xalan engine and outputs the results to the page) but it is
only a portion of the larger page, which contains Chinese/Japanse
Unicode characters.

So in that case, what's the problem with the XSLT derived portions being
in utf8?

The output type set on the XSL template (US-ASCII) then overrides the
encoding of the jsp page (UTF-8) and although we see the correct
conversions within the XSLT output "zone", all the Unicode characters
outside this zone are no longer readable.

What I really need to know is, is there a way to keep the UTF-8 output
encoding and have Xalan still convert Chinese/Japanese characters into
entities?

I'm not sure abut xalan, saxon has extension attributes on xsl:eek:utput
that can control this.

In XSLT1 I have in the past specified us-ascii (to get non ascii
characters as references) and then just removed the encoding declaration
in the result in a post process (with sed/per/whatever) that way the
files will be parsed as utf8 and utf8 characters will work.

In xslt2 (eg saxon8) you will be able to specify us-ascii and also
specify that no xml declaration is output, this is explictly for use
cases such as yours where the output from xslt needs to be merged with
other things.
It seems like it should do this by default, as these symbols
are just another character in the Unicode range.

All XML characters fit this description. If you specify an encoding that
includes the characters (and utf8 includes them all) then the normal
behaviour is that the characters are output as character data (You
indicate that accented letters are being output as numeric references,
which would be surprising, but conformant behaviour)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top