Invalid byte 2 of 3-byte UTF-8 sequence - inconsistent behavior

KN · Nov 15, 2007

I am having problems with some Danish characters within some XML files
I am processing
using javax.xml.parsers.DocumentBuilder
I am having several related probelms - which I will now list

Issue 1:

The XML file has the encoding set in the XML declaration:

<?xml version="1.0" encoding="utf-8"?>

however - when I open the file in a hex editor I can see that the BOM
is set as

FF FE

Which translates to UTF-16 little-endian

- is this a problem - I had assumed that the XML declaration took
precedent ?

Issue 2:

I am getting conflicting data in the input files over a period of
time:

in some input files the character

æ
(LATIN SMALL LETTER AE)
C3 A6 (UTF-8 code-units)
E6 (Hex code units)

is received

... but under other (unknown) circumstances - this character is
received in identical
records on different occasions as:

Ã|

C3 83 C2 A6 (UTF-8 code-units)
C3 A6 (Hex code units)

- depending on how I create the InputSource (see section "parsing
methods")

character:
Ã|
using method [1] and method [2] is parsed successfully

character:
æ
using method [1] and method [2] is NOT parsed successfully

(so the Ã| representation (which is displayed/represented incorrect)
- is parsed correctly, but the æ character causes a SaxParseException
no matter what method I use)

----
Parsing methods

I am processing the file using a DOM parser but I get a
SaxParseExceptions with the message
"Invalid byte 2 of 3-byte UTF-8 sequence"

I create a DOM parser using the following procedure (am including
fully-qualified class names when
types are first encountered for clarity)

javax.xml.parsers.DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);

javax.xml.parsers.DocumentBuilder db = dbf.newDocumentBuilder();

java.io.ByteArrayInputStream bis = XXX;
org.xml.sax.InputSource inputSource = new InputSource(bis);
org.w3c.dom.Document doc = db.parse(inputSource);

- now this is where things get even more confusing - I get failures at
different points
in the XML file depending on the method I use to create the
ByteArrayInputStream object,

Method [1]

byte[] b = <get-the-byte-array-from-file>

ByteArrayInputStream bis = new ByteArrayInputStream (b);

or ...

method [2]

bis = new ByteArrayInputStream (new String(b, "UTF-8").getBytes());

Can anyone shed any light on what's going on here ??

regards

Richard Tobin · Nov 15, 2007

KN said:
Issue 1:

The XML file has the encoding set in the XML declaration:

<?xml version="1.0" encoding="utf-8"?>

however - when I open the file in a hex editor I can see that the BOM
is set as

FF FE

Which translates to UTF-16 little-endian

- is this a problem - I had assumed that the XML declaration took
precedent ?

It's not a question of precedence. The file has to be in one
encoding, and if it's in UTF-8 it can't have FF FE in because that's
not a legal UTF-8 byte sequence.

(LATIN SMALL LETTER AE)
C3 A6 (UTF-8 code-units)
E6 (Hex code units)

is received

... but under other (unknown) circumstances - this character is
received in identical
records on different occasions as:

Ã|

C3 83 C2 A6 (UTF-8 code-units)
C3 A6 (Hex code units)

It appears that you have read UTF-8 text as Latin-1, so that
the two UTF-8 bytes were interpreted as two characters, which
were then written out in UTF-8 as two pairs of two bytes.

-- Richard

Philippe Poulard · Nov 15, 2007

Richard Tobin a écrit :

It's not a question of precedence. The file has to be in one
encoding, and if it's in UTF-8 it can't have FF FE in because that's
not a legal UTF-8 byte sequence.

it is specified somewhere (1) that a BOM can be present in UTF-8 (which
is totally useless IMHO)

thus, if you want to have FE FF in an UTF-8 encoded file, the right
sequence is :
EF BB BF

(2) is a smart tool for playing with unicode

(1) http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
(2) http://people.w3.org/rishida/scripts/uniview/conversion.php

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !

Philippe Poulard · Nov 15, 2007

(2) is a smart tool for playing with unicode

....and as Richard said, you can check that FE FF in UTF-8 is an invalid
sequence, and that the hexa FEFF will be encoded in EF BB BF as expected

(2) http://people.w3.org/rishida/scripts/uniview/conversion.php

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !

Richard Tobin · Nov 15, 2007

(2) is a smart tool for playing with unicode
(2) http://people.w3.org/rishida/scripts/uniview/conversion.php

I have something similar but less fancy at

http://www.cogsci.ed.ac.uk/~richard/utf-8.html

It has one extra feature: accepting and displaying the UTF-8 byte
sequence as if it were Latin-1, which is often useful for explaining
mysterious errors.

-- Richard

Andreas Prilop · Nov 15, 2007

I have something similar but less fancy at
http://www.cogsci.ed.ac.uk/~richard/utf-8.html
It has one extra feature: accepting and displaying the UTF-8 byte
sequence as if it were Latin-1, which is often useful for explaining
mysterious errors.

Just what is "Latin-1"? Is it ISO-8859-1 or Windows-1252 or
something else? And what do you do with undefined code positions
in ISO-8859-1 or Windows-1252?

Richard Tobin · Nov 15, 2007

Andreas Prilop said:
Just what is "Latin-1"? Is it ISO-8859-1
Yes.

And what do you do with undefined code positions
in ISO-8859-1 or Windows-1252?

They'll display as whatever the browser displays them as. I think many
display them as if they were in the Windows encoding, because that's
what's there in the font.

-- Richard

Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.	6	Jan 21, 2010
slice! invalid byte sequence in UTF-8	9	Mar 3, 2011
XML-Parsing with UTF-8 Byte-Order-Mark (BOM)	0	Jun 25, 2007
XML-Parsing with UTF-8 Byte-Order-Mark (BOM)	3	Jun 25, 2007
InputStream - invalid byte 1 of 1-byte UTF-8 sequence	2	Dec 27, 2004
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013

Invalid byte 2 of 3-byte UTF-8 sequence - inconsistent behavior

KN

Richard Tobin

Philippe Poulard

Philippe Poulard

Richard Tobin

Andreas Prilop

Richard Tobin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads