Errors parsing Japanese chars

Sriv Chakravarthy · Jul 8, 2003

I am trying to use xerces-c SAX parser to parse japanese characters. I
have a <?xml... utf-8> line in the xml file. When the parser
encounters the jap characters it throws a UTFDataFormatException.
I am quite new to xml and I am not sure how to deal with this
situation.
Is there a way to parse the jap characters ? or should the japanese
characters be escaped in the xml file (i.e. &#1234) for this to work.

Alan J. Flavell · Jul 8, 2003

I am trying to use xerces-c SAX parser to parse japanese characters. I
have a <?xml... utf-8> line in the xml file. When the parser
encounters the jap characters it throws a UTFDataFormatException.

Seems to be indicating that the Japanese characters are not in fact
encided in utf-8, then.

I am quite new to xml and I am not sure how to deal with this
situation.

Irrespective of xml or not xml, any text file needs to be accompanied
with information on its encoding if it's to be reliably read. (Modulo
some heuristics which claim to auto-recognise a limited number of
encodings[1]).

Is there a way to parse the jap characters ?

If I've understood what you're reporting, it's not a matter of
_parsing_ them, it's a matter of understanding them in the first
place.

or should the japanese
characters be escaped in the xml file (i.e. &#1234) for this to work.

Not necessarily. And indeed it's a most inefficent way to represent
them if a large quantity of CJK text is involved. But yes, it's
certainly a legal possibility.

Can you view your data (e.g as plain text) in a web browser? (Or if
you haven't got a web browser, try MSIE...) Which character coding
does the browser need to be set to in order to make sense of the
Japanese? (You might try its auto recognition options and if it's
successful, then check to see which encoding it has chosen).

Then, if the encoding is one that's supported by the parser software,
just nominate it on the <?xml... thingy.

hope this helps.

[1] or of course the BOM, if you know for a fact that it's
a unicode encoding that you're dealing with.

Problem in parsing xml document with japanese text	0	Jan 9, 2004
Japanese characters in TITLE element	28	Apr 4, 2011
Transcode Japanese??	2	Apr 19, 2005
Write/Read File having Japanese characters as file name	1	Jul 8, 2010
Errors When Pulling Information from CSV File to Python	0	Dec 10, 2020
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Xerces C++ on Japanese Windows	0	Apr 5, 2004
C++ SAX Parser ---handling special characters	2	Apr 17, 2007

Errors parsing Japanese chars

Sriv Chakravarthy

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads