Java sax UTF-8 parsing troubles -- PLEASE HELP...

Discussion in 'XML' started by Aleksandar Matijaca, Aug 30, 2004.

  1. Hi there,

    I am in some need of help. I am trying to parse using the apache sax
    parser
    a file that has vaid UTF-8 characters - I keep end up getting a

    sun.io.MalformedInputException error.

    This is my code:

    infile = "<?xml version=\"1.0\"
    encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
    Yen</currency_display></display_values>";

    // the above is perfectly valid UNICODE symbol for Yen

    XMLReader xr = new org.apache.xerces.parsers.SAXParser();

    xr.setContentHandler(this);
    xr.setErrorHandler(this);

    ByteArrayInputStream bi = new
    ByteArrayInputStream(infile.getBytes());
    Reader reader = new InputStreamReader(bi,"UTF-8");
    InputSource is = new InputSource(reader);
    is.setEncoding("UTF-8");
    xr.parse(is); // CRASHES RIGHT HERE...

    this is the complete trace...

    [8/29/04 22:38:40:756 GMT-05:00] 692c692c SystemErr R
    sun.io.MalformedInputException
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    java.lang.Throwable.<init>(Throwable.java)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    sun.nio.cs.StreamDecoder.read(StreamDecoder.java)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    java.io.InputStreamReader.read(InputStreamReader.java)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.impl.XMLEntityScanner.scanQName(Unknown Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
    Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
    Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
    Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    [8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
    com.polyorb.tipranavir.pdf.ConvertXML.cparse(ConvertXML.java)


    What am I doing wrong here???

    Thank you for any guideance...

    Regards, Alex.
     
    Aleksandar Matijaca, Aug 30, 2004
    #1
    1. Advertising

  2. Aleksandar Matijaca () wrote:
    : Hi there,

    : I am in some need of help. I am trying to parse using the apache sax
    : parser
    : a file that has vaid UTF-8 characters - I keep end up getting a

    : sun.io.MalformedInputException error.

    : This is my code:

    : infile = "<?xml version=\"1.0\"
    : encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
    : Yen</currency_display></display_values>";

    The string in java is not utf-8, it's utf-16, so if you pass the "raw
    bytes" of the string to the parser then it isn't utf-8.

    However, I haven't ever used the specific set of instructions you are
    using, so I don't know for sure that is the problem.
     
    Malcolm Dew-Jones, Aug 30, 2004
    #2
    1. Advertising

  3. Aleksandar Matijaca wrote:


    > I am in some need of help. I am trying to parse using the apache sax
    > parser
    > a file that has vaid UTF-8 characters - I keep end up getting a
    >
    > sun.io.MalformedInputException error.
    >
    > This is my code:
    >
    > infile = "<?xml version=\"1.0\"
    > encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
    > Yen</currency_display></display_values>";
    >
    > // the above is perfectly valid UNICODE symbol for Yen
    >
    > XMLReader xr = new org.apache.xerces.parsers.SAXParser();
    >
    > xr.setContentHandler(this);
    > xr.setErrorHandler(this);
    >
    > ByteArrayInputStream bi = new
    > ByteArrayInputStream(infile.getBytes());


    I suspect the problem is here, getBytes using the platform's default
    encoding (character set) while you want UTF-8 so try
    infile.getBytes("UTF8")


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Aug 30, 2004
    #3
  4. MARTIN - THIS FIXED IT!!! It was the infile.getBytes("UTF-8")
    Martin and Malcolm, thank you very much for your suggestions.

    All the best, Alex.
    (Toronto)


    Martin Honnen <> wrote in message news:<4133173a$0$6642$-online.net>...
    > Aleksandar Matijaca wrote:
    >
    >
    > > I am in some need of help. I am trying to parse using the apache sax
    > > parser
    > > a file that has vaid UTF-8 characters - I keep end up getting a
    > >
    > > sun.io.MalformedInputException error.
    > >
    > > This is my code:
    > >
    > > infile = "<?xml version=\"1.0\"
    > > encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
    > > Yen</currency_display></display_values>";
    > >
    > > // the above is perfectly valid UNICODE symbol for Yen
    > >
    > > XMLReader xr = new org.apache.xerces.parsers.SAXParser();
    > >
    > > xr.setContentHandler(this);
    > > xr.setErrorHandler(this);
    > >
    > > ByteArrayInputStream bi = new
    > > ByteArrayInputStream(infile.getBytes());

    >
    > I suspect the problem is here, getBytes using the platform's default
    > encoding (character set) while you want UTF-8 so try
    > infile.getBytes("UTF8")
     
    Aleksandar Matijaca, Aug 30, 2004
    #4
  5. Aleksandar Matijaca

    Soren Kuula Guest

    Aleksandar Matijaca wrote:
    > Hi there,


    Hi, I can see you got your problem solved, but are you sure it is
    _really_ doint what you want it to do (and are you aware what is
    happening) ?

    Assuming the type of your parameter infile is String:

    Character encoding is the translation between character strings and byte
    strings. I assume also that whatever made the String infile, it has
    somehow managed to get the right chars out of the bytes in your file.

    I think that this happens:

    > infile = "<?xml version=\"1.0\"
    > encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
    > Yen</currency_display></display_values>";
    >
    > // the above is perfectly valid UNICODE symbol for Yen
    >
    > XMLReader xr = new org.apache.xerces.parsers.SAXParser();
    >
    > xr.setContentHandler(this);
    > xr.setErrorHandler(this);
    > ByteArrayInputStream bi = new
    > ByteArrayInputStream


    1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string
    representation of your String.(infile.getBytes());
    or
    1b) (after fix: )getBytes() returns UTF-ENCODED byte string
    representation of your String.(infile.getBytes());
    2) Your Reader then correctly DEcodes the byte stream into chars again
    > Reader reader = new InputStreamReader(bi,"UTF-8");
    > InputSource is = new InputSource(reader);

    3) the setEncoding statement should really have no effect; the
    InputSource does not have the challenge of turning bytes into chars as
    it already has a Reader (a source of chars, as opposed to a Stream
    (source of bytes) so extract characters from. In other words, the
    decoding work should have been done already)
    > is.setEncoding("UTF-8");
    > xr.parse(is); // CRASHES RIGHT HERE...


    - because UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily
    = s for some String s.

    I think you read a file into a String (correctly decoded, maybe by
    coincidence).
    Then you encode that String (a String is just a sequence of chars) into
    bytes and decode that back into chars again. No need for that !!

    I suggest you let an InputStream read from your file, and use that
    InputStream DIRECTLY as an argument to your InputSource. Reason : The
    InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
    encoding="blah"... IN the XML file PROPER. Then it will automagically
    use the proper decoding.

    If that fails, you may try open a Reader on an InputStream in the file,
    and then supply the encoding yourself (taking the risk that one day you
    will prefer to write your XML files in some other encoding, and your
    program will not work anymore).

    Anyway encoding a String into bytes and then back to a source of chars
    (a Reader) only adds confusion.

    Soren
     
    Soren Kuula, Aug 31, 2004
    #5
  6. Yes, actualy, the string does have some UTF-8 characters which I am indeed
    expecting. I am expecting a combination of Yen currency characters, British
    pounds etc... This is an XML stream that needs to be parsed, modified, and
    sent to FOP for PDF generation.

    I have allways dealt with SAX parsing with plain Strings, and
    that has allways worked, however, I realy did get stuck on this one...


    Regards, Alex.

    Soren Kuula <dongfang-remove_this@remove_this-bitplanet.net> wrote in message news:<3v3Zc.41993$>...
    > Aleksandar Matijaca wrote:
    > > Hi there,

    >
    > Hi, I can see you got your problem solved, but are you sure it is
    > _really_ doint what you want it to do (and are you aware what is
    > happening) ?
    >
    > Assuming the type of your parameter infile is String:
    >
    > Character encoding is the translation between character strings and byte
    > strings. I assume also that whatever made the String infile, it has
    > somehow managed to get the right chars out of the bytes in your file.
    >
    > I think that this happens:
    >
    > > infile = "<?xml version=\"1.0\"
    > > encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
    > > Yen</currency_display></display_values>";
    > >
    > > // the above is perfectly valid UNICODE symbol for Yen
    > >
    > > XMLReader xr = new org.apache.xerces.parsers.SAXParser();
    > >
    > > xr.setContentHandler(this);
    > > xr.setErrorHandler(this);
    > > ByteArrayInputStream bi = new
    > > ByteArrayInputStream

    >
    > 1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string
    > representation of your String.(infile.getBytes());
    > or
    > 1b) (after fix: )getBytes() returns UTF-ENCODED byte string
    > representation of your String.(infile.getBytes());
    > 2) Your Reader then correctly DEcodes the byte stream into chars again
    > > Reader reader = new InputStreamReader(bi,"UTF-8");
    > > InputSource is = new InputSource(reader);

    > 3) the setEncoding statement should really have no effect; the
    > InputSource does not have the challenge of turning bytes into chars as
    > it already has a Reader (a source of chars, as opposed to a Stream
    > (source of bytes) so extract characters from. In other words, the
    > decoding work should have been done already)
    > > is.setEncoding("UTF-8");
    > > xr.parse(is); // CRASHES RIGHT HERE...

    >
    > - because UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily
    > = s for some String s.
    >
    > I think you read a file into a String (correctly decoded, maybe by
    > coincidence).
    > Then you encode that String (a String is just a sequence of chars) into
    > bytes and decode that back into chars again. No need for that !!
    >
    > I suggest you let an InputStream read from your file, and use that
    > InputStream DIRECTLY as an argument to your InputSource. Reason : The
    > InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
    > encoding="blah"... IN the XML file PROPER. Then it will automagically
    > use the proper decoding.
    >
    > If that fails, you may try open a Reader on an InputStream in the file,
    > and then supply the encoding yourself (taking the risk that one day you
    > will prefer to write your XML files in some other encoding, and your
    > program will not work anymore).
    >
    > Anyway encoding a String into bytes and then back to a source of chars
    > (a Reader) only adds confusion.
    >
    > Soren
     
    Aleksandar Matijaca, Sep 1, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xavier Seneque

    Parsing an XML stream with java ( sax )

    Xavier Seneque, Feb 23, 2005, in forum: Java
    Replies:
    2
    Views:
    10,716
    Xavier Seneque
    Feb 24, 2005
  2. Igor Akkerman
    Replies:
    0
    Views:
    385
    Igor Akkerman
    Jul 30, 2003
  3. Jonathan
    Replies:
    0
    Views:
    431
    Jonathan
    Oct 28, 2003
  4. Naren
    Replies:
    0
    Views:
    615
    Naren
    May 11, 2004
  5. KK
    Replies:
    2
    Views:
    728
    Big Brian
    Oct 14, 2003
Loading...

Share This Page