Strangeness with Japanese, XML, Java

Robert M. Gary · Apr 15, 2005

I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...

1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.

2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).
Other character sets (UTC, etc) result in a different document.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).

Has anyone else seen this??
Here is my transformer...

Document new_document = documentBuilder.parse("japan2.xml");
System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);

Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");

//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());

//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );

transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);

transformer.transform(new_source, new_result);

String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );

Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"
invokeId="2"><AlertList><Alert><Name>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Name><AffectedObjects
type="Obj"><Obj><Name>ja_mo-¤¢¤¨¤¤¤ª¤¦</Name></Obj></AffectedObjects><Properties><Property><Name>Severity</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr
name="TOD"><Int32>1112980583</Int32></Attr><Attr
name="DMPAlarmObject"><Str>ja_mo-¤¢¤¨¤¤¤ª¤¦</Str></Attr><Attr
name="CLASS"><Str>NetExpert</Str></Attr><Attr
name="MANAGER"><Str>NetExpert</Str></Attr><Attr
name="DMPAlarmName"><Str>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Str></Attr><Attr
name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr
name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr
name="MsgType"><Str>Alarm</Str></Attr><Attr
name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr
name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>

When I try to read it using IBM's ICU character set tool uconv I get the
following...
=> uconv -f UTF-8 ~/test/xml/japan.xml
Conversion to Unicode from codepage failed at input byte position 116.
Bytes: a4 Error: Illegal character found
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true"
invokeId="1"><AlertList><Alert><Name>ja_alert-

However, when I tell it the document is EUC-JP it works...
=> uconv -f EUC-JP ~/test/xml/japan.xml
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true" invokeId=......

So, the document appears to be EUC-JP even though the Java DOM says its
UTF-8
-Robert

Soren Kuula · Apr 15, 2005

Hi said:
I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...

1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.

1) Find out under which encoding your java source editor saves your java
source files. Check your result.

2) javac -encoding said:
2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.

How do you serialize your DOMs? I guess you will have
UTF-8-decode(EUC-JP-encode(UTF-8decode(EUC-JP-encode(literals))))
if you edit in EUC-JP, compile as UTF-8 and run your data throgh a
Writer that takes the platform default encoding ... that's a mess

Check that you override the platform default encoding and really go
UTF-8 when you serialize.

Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).

Problem is where you serialize the document, not where you construct,
modify or transform it. And possibly in the decoding (by javac) of your
program text literals.

Other character sets (UTC, etc) result in a different document.

Probably the document is read in correctly .. anything else than unicode
and EUC will not be able to contain all the Japanese, and will bust.

So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).

I'm pretty sure the error is where you output the data (you haven't
shown it..)

Has anyone else seen this??

All the time...

Document new_document = documentBuilder.parse("japan2.xml");

Verify until you are bloody sure what the encoding is of your input
document, and that it really matches with what the header says.
I think a mismatch will not result in an exception or anything, only bad
contents...

System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);
Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");

//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());

//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );

transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);

transformer.transform(new_source, new_result);

String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );

PSE show us how it got into that file.

Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"

....

Soren

Soren Kuula · Apr 15, 2005

Hi, Robert and myself,

Soren said:
Problem is where you serialize the document, not where you construct,
modify or transform it. And possibly in the decoding (by javac) of your
program text literals.

Probably the document is read in correctly .. anything else than unicode
and EUC will not be able to contain all the Japanese, and will bust.

Sorry, I misunderstood you there .. you mean, the OUTput is identical
except for the header?

I would take that as an indication that whatever you use for serializing
the DOM a byte sequence (file) does not look at what you set the
transformer to use. You will have to control that elsewhere.

Are you by any chance instantiating your own Writers when serializing?
Tried to give them different sencoding settings?

Soren

Issue with textbox script?	0	Sep 5, 2022
Finding all instances of a string in an XML file	0	Jun 21, 2013
Transcode Japanese??	2	Apr 19, 2005
Flattening out an XML document	0	May 24, 2005
how to $doc->createElement with XML::LibXML	2	Feb 22, 2010
problem with schema and xml	2	Dec 29, 2011
Trouble displaying Japanese text with aspx	11	May 8, 2007
Serializing an XML Dom	1	Dec 1, 2005

Strangeness with Japanese, XML, Java

Robert M. Gary

Soren Kuula

Soren Kuula

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads