SAX & UTF-8 problem

C

Chris

My SAX parser is choking on UTF-8 encoded files (a "document root element is
missing" error). The problem is three bytes that appear at the beginning of
each file:

0xEF 0xBB 0xBF

If I delete the bytes the problem goes away.

I'm accessing the file by using a FileInputStream and then wrapping it in a
SAX InputSource. My guess is that the InputSource is converting bytes to
chars using the platform's default encoding, rather than UTF-8.

Is there any existing InputSource class or Reader class that will
automatically detect UTF-8 and encode chars correctly? Or do I have to write
my own Reader class to do it?
 
R

Roedy Green

Is there any existing InputSource class or Reader class that will
automatically detect UTF-8 and encode chars correctly? Or do I have to write
my own Reader class to do it?

The thing with most chance of success is a Reader with an explicit
UTF-8 encoding.

Hopefully it will just discard the signature.
 
V

Van Ly

Chris said:
My SAX parser is choking on UTF-8 encoded files (a "document root
element is missing" error). The problem is three bytes that appear at
the beginning of each file:

0xEF 0xBB 0xBF

If I delete the bytes the problem goes away.

I'm getting the same problem with Javadoc. With the three bytes above
(referred to as the BOM) of a UTF-8 file, Javadoc will choke because
it considers the three bytes illegal characters. Of course, removing
the three bytes will get Javadoc going again.

I've tried the "-encoding UTF8" and "-encoding UTF-8" options of Javadoc.
But it still bombs. Anyone reading this can help?

Thanks,
Van
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top