XML-Parsing with UTF-8 Byte-Order-Mark (BOM)

Patrick.Gebhardt · Jun 25, 2007

Hello,

i have a really weird problem.

The environment is a client - server application, where the client
reads an UTF-8 encoded XML file (with cyrillic characters e.g.) which
is then send to the server, where it is parsed in 2 different ways -
first using a normal SaxParser then via Castor (which is using the
_same_ parser library)

relevant Libs: xercesImpl 2.9.0, castor 0.9.5

The client-XML file is UTF-8 with BOM (hex: EB BB BF).

The client sends this file via a commons-httpclient POST call to the
server using the correct content-type.
I ensure on the server side, that the file is received correcly, i can
read the cyrrilic characters in the logfile after doing the following
in the servlet:

the following is obviously pseudoCode:

doPost() {
request.setCharacterEncoding("UTF8");
InputStream in = request.getInputStream();

ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];

int count = in.read(buffer);
while( count != -1) {
baos.write(buffer, 0, count);
count = in.read(buffer);
}

byte[] xml = baos.toByteArray();
String s = new String(xml, "UTF8"); --> string is correct, contains
cyrrillic characters

--- until here, everything is fine.

--- Now i have to parse the xml to find a node-attribute and decide
upon the value into which
--- castor classes i have to unmarshal the XML.
--- To be able to call castor, i need a second input stream which
castor will be using.
--- therefore i copy the byte[] and create a second stream.
--- (the files are really small, therefore i dont expect memory
problems)

byte[] xmlCastor = new byte[xml.length];
System.arraycopy(xml, 0, xmlCastor, 0, xml.length);

ByteArrayInputStream bais = new ByteArrayInputStream(xml);
ByteArrayInputStream baisCastor = new
ByteArrayInputStream(xmlCastor);

-- i can verify in the logfile, that these 2 byte arrays contain the
same cyrillic characters.

-- now i call the SaxParser with the first stream, and i receive the
node attribute.
-- then i pass the second stream to castor ... and bummer ...

Caused by: org.xml.sax.SAXException: Parsing Error: Content is not
allowed in prolob.

-- that is because of the byte-order mark, the Parser does not like
it.
-- 2 identical streams (as far as i can tell) called by the same
parser ... one runs into an exception,
-- the second does not

-- I have _exactly one_ Parser in my Tomcat in WEB-INF/Lib, and that
is xercesImpl-2.9.0.jar.
-- Is it somehow possible that Tomcat provides a different version ? I
cannot verify how Castor is
-- choosing his XML parser, but i do it the following way:

SAXParserFactory pf = SAXParserFactory.newInstance();
XMLReader parser = pf.newSAXParser().getXMLReader();
parser.parse(new InputSource(bais));

Any helpful Tips appreciated!

P.S: i can't change very much of the infrastructure ... Castor e.g is
definitly a set condition.

Mike Schilling · Jun 25, 2007

Create a FilterInputStream that doesn't pass the BOM through, and parse
*that*.

Roedy Green · Jun 29, 2007

-- that is because of the byte-order mark, the Parser does not like
it.

Java i/o is not clever enough to read the byte order mark, adjust its
notion of the encoding and discard it as would be done in heaven.

Instead it just passes it through to the app as a single character.
So all application software needs to just discard such characters.

Have a look at the FilterInputStream and FilterOutputStream. It
should be possible to insert a layer of filtering to discard such
characters.

Mike Schilling · Jun 29, 2007

Roedy Green said:
Java i/o is not clever enough to read the byte order mark, adjust its
notion of the encoding and discard it as would be done in heaven.

It does with UTF-16, I think. The BOM for UTF-8 is a Microsoft invention,
not a standard. (Though Java should handle it anyway; interoperability is
more important than corporate politics.)

XML-Parsing with UTF-8 Byte-Order-Mark (BOM)	0	Jun 25, 2007
XML and Invalid byte UTF-8	7	May 9, 2005
codec for UTF-8 with BOM	3	May 2, 2011
PEP 8: Byte Order Mark (BOM) vs coding cookie	2	Aug 24, 2008
Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.	6	Jan 21, 2010
utf-16 little endian byte order mark with libxml-ruby	1	Jul 25, 2007
2to3 ParseError with UTF-8 BOM	3	Nov 5, 2009
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013

XML-Parsing with UTF-8 Byte-Order-Mark (BOM)

Patrick.Gebhardt

Mike Schilling

Roedy Green

Mike Schilling

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads