Strangeness with Japanese, XML, Java

Discussion in 'XML' started by Robert M. Gary, Apr 15, 2005.

  1. I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
    character set is EUC-JP
    I'm seeing two strange things when using Japanese character sets...

    1) If I write a program that does
    System.out.println("$^%$%^^" ); //assume those are Japanese characters that
    are multibyte under EUC-JP
    The resulting output looks NOTHING like the characters I typed in.
    Apparently the character set being used to read the literal is different
    from the default.

    2) If I create an XML document using the built in DOM which contains
    elements with values in Japanese, I get strangeness when I transform that
    into an XML document. If I do not set the character set in the transformer
    the document will say its in UTF-8 (the XML header will). However, the
    actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
    (it knows nothing of XML, just character sets) and when I try to read the
    document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
    However, if I try to read it telling it the document is EUC-JP it says its
    good.
    Also, when I change the transformer to use EUC-JP it creates the same
    document bit-for-bit (other than changing the XML header to say EUC-8).
    Other character sets (UTC, etc) result in a different document.
    So, my conclusion is that by default the XML DOM says its UTF-8 in the
    header, but ALWAYS uses the platform default unless you specify something
    else (UTC for example).

    Has anyone else seen this??
    Here is my transformer...

    Document new_document = documentBuilder.parse("japan2.xml");
    System.out.println("I just read japan2.xml");
    DOMSource new_source = new DOMSource(new_document);
    StringWriter new_writer = new StringWriter();
    StreamResult new_result = new StreamResult(new_writer);

    Properties p = transformer.getOutputProperties();
    //try explicit EUC
    //p.setProperty(OutputKeys.ENCODING, "EUC-JP");

    //try default (EUC)
    //p.setProperty(OutputKeys.ENCODING,
    // new OutputStreamWriter(new
    ByteArrayOutputStream()).getEncoding());

    //try UTF explicityly
    //p.setProperty(OutputKeys.ENCODING, "UTF-8" );

    transformer.setOutputProperties(p);
    Properties p2 = transformer.getOutputProperties();
    p2.list(System.out);

    transformer.transform(new_source, new_result);

    String new_text_doc = new_writer.toString();
    System.out.println("XML doc is "+new_text_doc );


    Resulting document...
    XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
    confirmed="true"
    invokeId="2"><AlertList><Alert><Name>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Name><AffectedObjects
    type="Obj"><Obj><Name>ja_mo-¤¢¤¨¤¤¤ª¤¦</Name></Obj></AffectedObjects><Properties><Property><Name>Severity</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr
    name="TOD"><Int32>1112980583</Int32></Attr><Attr
    name="DMPAlarmObject"><Str>ja_mo-¤¢¤¨¤¤¤ª¤¦</Str></Attr><Attr
    name="CLASS"><Str>NetExpert</Str></Attr><Attr
    name="MANAGER"><Str>NetExpert</Str></Attr><Attr
    name="DMPAlarmName"><Str>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Str></Attr><Attr
    name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr
    name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr
    name="MsgType"><Str>Alarm</Str></Attr><Attr
    name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr
    name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>

    When I try to read it using IBM's ICU character set tool uconv I get the
    following...
    => uconv -f UTF-8 ~/test/xml/japan.xml
    Conversion to Unicode from codepage failed at input byte position 116.
    Bytes: a4 Error: Illegal character found
    <?xml version="1.0" encoding="UTF-8"?>
    <GenAlertsReq confirmed="true"
    invokeId="1"><AlertList><Alert><Name>ja_alert-

    However, when I tell it the document is EUC-JP it works...
    => uconv -f EUC-JP ~/test/xml/japan.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <GenAlertsReq confirmed="true" invokeId=......

    So, the document appears to be EUC-JP even though the Java DOM says its
    UTF-8
    -Robert
     
    Robert M. Gary, Apr 15, 2005
    #1
    1. Advertising

  2. Robert M. Gary

    Soren Kuula Guest

    Hi
    Robert M. Gary wrote:
    > I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
    > character set is EUC-JP
    > I'm seeing two strange things when using Japanese character sets...


    > 1) If I write a program that does
    > System.out.println("$^%$%^^" ); //assume those are Japanese characters that
    > are multibyte under EUC-JP
    > The resulting output looks NOTHING like the characters I typed in.
    > Apparently the character set being used to read the literal is different
    > from the default.


    1) Find out under which encoding your java source editor saves your java
    source files. Check your result.

    2) javac -encoding <whatever you found above> ...java

    > 2) If I create an XML document using the built in DOM which contains
    > elements with values in Japanese, I get strangeness when I transform that
    > into an XML document. If I do not set the character set in the transformer
    > the document will say its in UTF-8 (the XML header will). However, the
    > actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
    > (it knows nothing of XML, just character sets) and when I try to read the
    > document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
    > However, if I try to read it telling it the document is EUC-JP it says its
    > good.


    How do you serialize your DOMs? I guess you will have
    UTF-8-decode(EUC-JP-encode(UTF-8decode(EUC-JP-encode(literals))))
    if you edit in EUC-JP, compile as UTF-8 and run your data throgh a
    Writer that takes the platform default encoding ... that's a mess :)

    Check that you override the platform default encoding and really go
    UTF-8 when you serialize.

    > Also, when I change the transformer to use EUC-JP it creates the same
    > document bit-for-bit (other than changing the XML header to say EUC-8).


    Problem is where you serialize the document, not where you construct,
    modify or transform it. And possibly in the decoding (by javac) of your
    program text literals.

    > Other character sets (UTC, etc) result in a different document.


    Probably the document is read in correctly .. anything else than unicode
    and EUC will not be able to contain all the Japanese, and will bust.

    > So, my conclusion is that by default the XML DOM says its UTF-8 in the
    > header, but ALWAYS uses the platform default unless you specify something
    > else (UTC for example).


    I'm pretty sure the error is where you output the data (you haven't
    shown it..)

    > Has anyone else seen this??


    All the time...

    > Document new_document = documentBuilder.parse("japan2.xml");


    Verify until you are bloody sure what the encoding is of your input
    document, and that it really matches with what the header says.
    I think a mismatch will not result in an exception or anything, only bad
    contents...
    > System.out.println("I just read japan2.xml");
    > DOMSource new_source = new DOMSource(new_document);
    > StringWriter new_writer = new StringWriter();
    > StreamResult new_result = new StreamResult(new_writer);
    > Properties p = transformer.getOutputProperties();
    > //try explicit EUC
    > //p.setProperty(OutputKeys.ENCODING, "EUC-JP");
    >
    > //try default (EUC)
    > //p.setProperty(OutputKeys.ENCODING,
    > // new OutputStreamWriter(new
    > ByteArrayOutputStream()).getEncoding());
    >
    > //try UTF explicityly
    > //p.setProperty(OutputKeys.ENCODING, "UTF-8" );
    >
    > transformer.setOutputProperties(p);
    > Properties p2 = transformer.getOutputProperties();
    > p2.list(System.out);
    >
    > transformer.transform(new_source, new_result);
    >
    > String new_text_doc = new_writer.toString();
    > System.out.println("XML doc is "+new_text_doc );


    PSE show us how it got into that file.
    > Resulting document...
    > XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
    > confirmed="true"

    ....

    Soren
     
    Soren Kuula, Apr 15, 2005
    #2
    1. Advertising

  3. Robert M. Gary

    Soren Kuula Guest

    Hi, Robert and myself,
    Soren Kuula wrote:

    >> Also, when I change the transformer to use EUC-JP it creates the same
    >> document bit-for-bit (other than changing the XML header to say EUC-8).

    >
    >
    > Problem is where you serialize the document, not where you construct,
    > modify or transform it. And possibly in the decoding (by javac) of your
    > program text literals.
    >
    >> Other character sets (UTC, etc) result in a different document.

    >
    > Probably the document is read in correctly .. anything else than unicode
    > and EUC will not be able to contain all the Japanese, and will bust.


    Sorry, I misunderstood you there .. you mean, the OUTput is identical
    except for the header?

    I would take that as an indication that whatever you use for serializing
    the DOM a byte sequence (file) does not look at what you set the
    transformer to use. You will have to control that elsewhere.

    Are you by any chance instantiating your own Writers when serializing?
    Tried to give them different sencoding settings?

    Soren
     
    Soren Kuula, Apr 15, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert M. Gary

    Strangeness with Japanese, XML, Java

    Robert M. Gary, Apr 15, 2005, in forum: Java
    Replies:
    2
    Views:
    450
    Soren Kuula
    Apr 15, 2005
  2. Prakash
    Replies:
    0
    Views:
    543
    Prakash
    Jan 9, 2004
  3. debo_nair
    Replies:
    0
    Views:
    492
    debo_nair
    Feb 4, 2005
  4. Chronic Philharmonic

    Strangeness with javax.xml.stream.XMLStreamReader

    Chronic Philharmonic, Oct 29, 2007, in forum: Java
    Replies:
    0
    Views:
    440
    Chronic Philharmonic
    Oct 29, 2007
  5. Veloso
    Replies:
    30
    Views:
    830
Loading...

Share This Page