Java sax UTF-8 parsing troubles -- PLEASE HELP...

  • Thread starter Aleksandar Matijaca
  • Start date
A

Aleksandar Matijaca

Hi there,

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.MalformedInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);

ByteArrayInputStream bi = new
ByteArrayInputStream(infile.getBytes());
Reader reader = new InputStreamReader(bi,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...

this is the complete trace...

[8/29/04 22:38:40:756 GMT-05:00] 692c692c SystemErr R
sun.io.MalformedInputException
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
java.lang.Throwable.<init>(Throwable.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
sun.nio.cs.StreamDecoder.read(StreamDecoder.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
java.io.InputStreamReader.read(InputStreamReader.java)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLEntityScanner.scanQName(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
[8/29/04 22:38:40:776 GMT-05:00] 692c692c SystemErr R at
com.polyorb.tipranavir.pdf.ConvertXML.cparse(ConvertXML.java)


What am I doing wrong here???

Thank you for any guideance...

Regards, Alex.
 
M

Malcolm Dew-Jones

Aleksandar Matijaca ([email protected]) wrote:
: Hi there,

: I am in some need of help. I am trying to parse using the apache sax
: parser
: a file that has vaid UTF-8 characters - I keep end up getting a

: sun.io.MalformedInputException error.

: This is my code:

: infile = "<?xml version=\"1.0\"
: encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
: Yen</currency_display></display_values>";

The string in java is not utf-8, it's utf-16, so if you pass the "raw
bytes" of the string to the parser then it isn't utf-8.

However, I haven't ever used the specific set of instructions you are
using, so I don't know for sure that is the problem.
 
M

Martin Honnen

Aleksandar Matijaca wrote:

I am in some need of help. I am trying to parse using the apache sax
parser
a file that has vaid UTF-8 characters - I keep end up getting a

sun.io.MalformedInputException error.

This is my code:

infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);

ByteArrayInputStream bi = new
ByteArrayInputStream(infile.getBytes());

I suspect the problem is here, getBytes using the platform's default
encoding (character set) while you want UTF-8 so try
infile.getBytes("UTF8")
 
A

Aleksandar Matijaca

MARTIN - THIS FIXED IT!!! It was the infile.getBytes("UTF-8")
Martin and Malcolm, thank you very much for your suggestions.

All the best, Alex.
(Toronto)
 
S

Soren Kuula

Aleksandar said:
Hi there,

Hi, I can see you got your problem solved, but are you sure it is
_really_ doint what you want it to do (and are you aware what is
happening) ?

Assuming the type of your parameter infile is String:

Character encoding is the translation between character strings and byte
strings. I assume also that whatever made the String infile, it has
somehow managed to get the right chars out of the bytes in your file.

I think that this happens:
infile = "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
Yen</currency_display></display_values>";

// the above is perfectly valid UNICODE symbol for Yen

XMLReader xr = new org.apache.xerces.parsers.SAXParser();

xr.setContentHandler(this);
xr.setErrorHandler(this);
ByteArrayInputStream bi = new
ByteArrayInputStream

1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string
representation of your String.(infile.getBytes());
or
1b) (after fix: )getBytes() returns UTF-ENCODED byte string
representation of your String.(infile.getBytes());
2) Your Reader then correctly DEcodes the byte stream into chars again
Reader reader = new InputStreamReader(bi,"UTF-8");
InputSource is = new InputSource(reader);
3) the setEncoding statement should really have no effect; the
InputSource does not have the challenge of turning bytes into chars as
it already has a Reader (a source of chars, as opposed to a Stream
(source of bytes) so extract characters from. In other words, the
decoding work should have been done already)
is.setEncoding("UTF-8");
xr.parse(is); // CRASHES RIGHT HERE...

- because UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily
= s for some String s.

I think you read a file into a String (correctly decoded, maybe by
coincidence).
Then you encode that String (a String is just a sequence of chars) into
bytes and decode that back into chars again. No need for that !!

I suggest you let an InputStream read from your file, and use that
InputStream DIRECTLY as an argument to your InputSource. Reason : The
InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
encoding="blah"... IN the XML file PROPER. Then it will automagically
use the proper decoding.

If that fails, you may try open a Reader on an InputStream in the file,
and then supply the encoding yourself (taking the risk that one day you
will prefer to write your XML files in some other encoding, and your
program will not work anymore).

Anyway encoding a String into bytes and then back to a source of chars
(a Reader) only adds confusion.

Soren
 
A

Aleksandar Matijaca

Yes, actualy, the string does have some UTF-8 characters which I am indeed
expecting. I am expecting a combination of Yen currency characters, British
pounds etc... This is an XML stream that needs to be parsed, modified, and
sent to FOP for PDF generation.

I have allways dealt with SAX parsing with plain Strings, and
that has allways worked, however, I realy did get stuck on this one...


Regards, Alex.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top