Errors parsing Japanese chars

Discussion in 'XML' started by Sriv Chakravarthy, Jul 8, 2003.

  1. I am trying to use xerces-c SAX parser to parse japanese characters. I
    have a <?xml... utf-8> line in the xml file. When the parser
    encounters the jap characters it throws a UTFDataFormatException.
    I am quite new to xml and I am not sure how to deal with this
    situation.
    Is there a way to parse the jap characters ? or should the japanese
    characters be escaped in the xml file (i.e. &#1234) for this to work.
    Sriv Chakravarthy, Jul 8, 2003
    #1
    1. Advertising

  2. On Tue, Jul 8, Sriv Chakravarthy inscribed on the eternal scroll:

    > I am trying to use xerces-c SAX parser to parse japanese characters. I
    > have a <?xml... utf-8> line in the xml file. When the parser
    > encounters the jap characters it throws a UTFDataFormatException.


    Seems to be indicating that the Japanese characters are not in fact
    encided in utf-8, then.

    > I am quite new to xml and I am not sure how to deal with this
    > situation.


    Irrespective of xml or not xml, any text file needs to be accompanied
    with information on its encoding if it's to be reliably read. (Modulo
    some heuristics which claim to auto-recognise a limited number of
    encodings[1]).

    > Is there a way to parse the jap characters ?


    If I've understood what you're reporting, it's not a matter of
    _parsing_ them, it's a matter of understanding them in the first
    place.

    > or should the japanese
    > characters be escaped in the xml file (i.e. &#1234) for this to work.


    Not necessarily. And indeed it's a most inefficent way to represent
    them if a large quantity of CJK text is involved. But yes, it's
    certainly a legal possibility.

    Can you view your data (e.g as plain text) in a web browser? (Or if
    you haven't got a web browser, try MSIE...) Which character coding
    does the browser need to be set to in order to make sense of the
    Japanese? (You might try its auto recognition options and if it's
    successful, then check to see which encoding it has chosen).

    Then, if the encoding is one that's supported by the parser software,
    just nominate it on the <?xml... thingy.

    hope this helps.

    [1] or of course the BOM, if you know for a fact that it's
    a unicode encoding that you're dealing with.
    Alan J. Flavell, Jul 8, 2003
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Goldin

    Errors, errors, errors

    Mark Goldin, Jan 17, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    923
    Mark Goldin
    Jan 17, 2004
  2. Kosio

    Floats to chars and chars to floats

    Kosio, Sep 16, 2005, in forum: C Programming
    Replies:
    44
    Views:
    1,247
    Tim Rentsch
    Sep 23, 2005
  3. Hongyu
    Replies:
    9
    Views:
    887
    James Kanze
    Aug 8, 2008
  4. Amit Save
    Replies:
    0
    Views:
    156
    Amit Save
    Sep 6, 2005
  5. M.Posseth

    receiving ??? chars instead of "special" chars

    M.Posseth, Nov 15, 2004, in forum: ASP .Net Web Services
    Replies:
    3
    Views:
    215
    Dan Rogers
    Nov 16, 2004
Loading...

Share This Page