Transforming XML containing Asian characters?

Discussion in 'XML' started by mikeyjudkins@yahoo.com, Jun 8, 2005.

  1. Guest

    I have an XML file containing localized strings in 9 languages, encoded
    in Unicode (UTF-8). Im trying to parse this XML document via XSLT
    (Apache Xalan) to selectively render localized strings depending on a
    users selected language.

    The problem Im running into is that when the XML document is sent
    through the XSLT stylesheet, all European special characters (such as
    umlauts, accents, etc) are converted to html character entities (as
    expected behavior), however, the Asian character sets are shown as
    question marks in the page source. It seems as if the XSLT engine does
    not know how to convert the Asian Unicode strings to useable character
    entities (or it is trying incorrectly to convert them to something that
    then the browser cannot understand).

    As an example, below is the output I get using the simplest of XSLT to
    take the localized XML and output it to a UTF-8 encoded html file. Note
    the 2 lines in question marks, which in the XML source, appear
    correctly as Japanese and Chinese.

    French
    Français
    Französisch
    Français
    Frans
    Francese
    ???????????????
    ??????
    Francés

    Been banging my head against my desk for awhile now. Any ideas
    thankfully accepted!
    , Jun 8, 2005
    #1
    1. Advertising

  2. Guest

    For some reason when I set the xsl:eek:utput encoding attribute to
    "UTF-8", Xalan does not convert Asian symbols into numeric entity
    codes. It keeps the symbols in Unicode. But it does convert European
    double byte characters to numeric entities. Is there any way around
    this without changing the xsl:eek:utput encoding? I need it to convert all
    double byte characters (including Chinese, Japanese and European) into
    entity codes, while still maintaining a UTF-8 output encoding.

    Is this possible?
    , Jun 8, 2005
    #2
    1. Advertising

  3. writes:

    > For some reason when I set the xsl:eek:utput encoding attribute to
    > "UTF-8", Xalan does not convert Asian symbols into numeric entity
    > codes. It keeps the symbols in Unicode. But it does convert European
    > double byte characters to numeric entities. Is there any way around
    > this without changing the xsl:eek:utput encoding? I need it to convert all
    > double byte characters (including Chinese, Japanese and European) into
    > entity codes, while still maintaining a UTF-8 output encoding.
    >
    > Is this possible?


    specify US-ASCII as the output. The end result is then as you wish: the
    file itself will be ascii encoded (which is teh same as utf8 encoding
    for ascii characters) and all non ascii characters will be accessed by
    reference.

    David
    David Carlisle, Jun 8, 2005
    #3
  4. Guest

    That works, but then if I have any Unicode characters on the page, it
    shows up as garble (since these are not in the US-ASCII range). You
    probably are wondering why would there be any unicode characters on the
    page if they are all converted into entities during the XSLT
    conversion.

    This is because not all the content on the page is being output by the
    XSLT template (we have a JSP page containing a java bean which in turn
    calls the Xalan engine and outputs the results to the page) but it is
    only a portion of the larger page, which contains Chinese/Japanse
    Unicode characters.

    The output type set on the XSL template (US-ASCII) then overrides the
    encoding of the jsp page (UTF-8) and although we see the correct
    conversions within the XSLT output "zone", all the Unicode characters
    outside this zone are no longer readable.

    What I really need to know is, is there a way to keep the UTF-8 output
    encoding and have Xalan still convert Chinese/Japanese characters into
    entities? It seems like it should do this by default, as these symbols
    are just another character in the Unicode range.

    Thanks,

    Mike
    , Jun 8, 2005
    #4
  5. writes:

    > That works, but then if I have any Unicode characters on the page, it
    > shows up as garble (since these are not in the US-ASCII range). You
    > probably are wondering why would there be any unicode characters on the
    > page if they are all converted into entities during the XSLT
    > conversion.

    character references (which are not entity references, but yes)

    >
    > This is because not all the content on the page is being output by the
    > XSLT template (we have a JSP page containing a java bean which in turn
    > calls the Xalan engine and outputs the results to the page) but it is
    > only a portion of the larger page, which contains Chinese/Japanse
    > Unicode characters.


    So in that case, what's the problem with the XSLT derived portions being
    in utf8?


    >
    > The output type set on the XSL template (US-ASCII) then overrides the
    > encoding of the jsp page (UTF-8) and although we see the correct
    > conversions within the XSLT output "zone", all the Unicode characters
    > outside this zone are no longer readable.
    >
    > What I really need to know is, is there a way to keep the UTF-8 output
    > encoding and have Xalan still convert Chinese/Japanese characters into
    > entities?


    I'm not sure abut xalan, saxon has extension attributes on xsl:eek:utput
    that can control this.

    In XSLT1 I have in the past specified us-ascii (to get non ascii
    characters as references) and then just removed the encoding declaration
    in the result in a post process (with sed/per/whatever) that way the
    files will be parsed as utf8 and utf8 characters will work.

    In xslt2 (eg saxon8) you will be able to specify us-ascii and also
    specify that no xml declaration is output, this is explictly for use
    cases such as yours where the output from xslt needs to be merged with
    other things.

    > It seems like it should do this by default, as these symbols
    > are just another character in the Unicode range.


    All XML characters fit this description. If you specify an encoding that
    includes the characters (and utf8 includes them all) then the normal
    behaviour is that the characters are output as character data (You
    indicate that accented letters are being output as numeric references,
    which would be surprising, but conformant behaviour)

    >
    > Thanks,
    >
    > Mike
    David Carlisle, Jun 8, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jimi Sidewaze

    need display hints on asian char sets

    Jimi Sidewaze, Feb 3, 2005, in forum: HTML
    Replies:
    6
    Views:
    453
    Toby Inkster
    Feb 7, 2005
  2. gECKo
    Replies:
    2
    Views:
    471
    gECKo
    Mar 25, 2005
  3. Mr. Everest

    Zope and Asian Characters

    Mr. Everest, Feb 15, 2004, in forum: Python
    Replies:
    0
    Views:
    276
    Mr. Everest
    Feb 15, 2004
  4. dz

    showing asian characters

    dz, Jun 14, 2008, in forum: ASP .Net
    Replies:
    9
    Views:
    461
    Alexey Smirnov
    Jun 17, 2008
  5. steve_f

    transforming german characters

    steve_f, Aug 6, 2004, in forum: Perl Misc
    Replies:
    7
    Views:
    447
    steve_f
    Aug 9, 2004
Loading...

Share This Page