SAX & UTF-8 problem

Chris · Jul 8, 2004

My SAX parser is choking on UTF-8 encoded files (a "document root element is
missing" error). The problem is three bytes that appear at the beginning of
each file:

0xEF 0xBB 0xBF

If I delete the bytes the problem goes away.

I'm accessing the file by using a FileInputStream and then wrapping it in a
SAX InputSource. My guess is that the InputSource is converting bytes to
chars using the platform's default encoding, rather than UTF-8.

Is there any existing InputSource class or Reader class that will
automatically detect UTF-8 and encode chars correctly? Or do I have to write
my own Reader class to do it?

Roedy Green · Jul 8, 2004

Is there any existing InputSource class or Reader class that will
automatically detect UTF-8 and encode chars correctly? Or do I have to write
my own Reader class to do it?

The thing with most chance of success is a Reader with an explicit
UTF-8 encoding.

Hopefully it will just discard the signature.

Van Ly · Jul 9, 2004

Chris said:
My SAX parser is choking on UTF-8 encoded files (a "document root
element is missing" error). The problem is three bytes that appear at
the beginning of each file:

0xEF 0xBB 0xBF

If I delete the bytes the problem goes away.

I'm getting the same problem with Javadoc. With the three bytes above
(referred to as the BOM) of a UTF-8 file, Javadoc will choke because
it considers the three bytes illegal characters. Of course, removing
the three bytes will get Javadoc going again.

I've tried the "-encoding UTF8" and "-encoding UTF-8" options of Javadoc.
But it still bombs. Anyone reading this can help?

Thanks,
Van

Thomas Weidenfeller · Jul 9, 2004

Van said:
I've tried the "-encoding UTF8" and "-encoding UTF-8" options of Javadoc.
But it still bombs. Anyone reading this can help?

Write a pre-processor to filter the BOM mark out.

If you have control over the source code (I know, you don't have it for
JavaDoc), see

http://groups.google.com/[email protected]

/Thomas

Thomas Weidenfeller · Jul 9, 2004

Chris said:
Is there any existing InputSource class or Reader class that will
automatically detect UTF-8 and encode chars correctly? Or do I have to write
my own Reader class to do it?

We wrote one a year or so ago:

http://groups.google.com/[email protected]

/Thomas

Chris · Jul 10, 2004

Is there any existing InputSource class or Reader class that will

We wrote one a year or so ago:

http://groups.google.com/[email protected]

Thanks. That worked perfectly.

MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Detect XML document encodings with SAX	42	Nov 21, 2012
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
UTF-8 read & print?	6	Nov 25, 2012
SAX parser problem	3	Feb 18, 2006
Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
Array.index with utf-8	4	May 21, 2011

SAX & UTF-8 problem

Chris

Roedy Green

Van Ly

Thomas Weidenfeller

Thomas Weidenfeller

Chris

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads