Invalid byte 2 of 3-byte UTF-8 sequence - inconsistent behavior

K

KN

I am having problems with some Danish characters within some XML files
I am processing
using javax.xml.parsers.DocumentBuilder
I am having several related probelms - which I will now list

Issue 1:

The XML file has the encoding set in the XML declaration:

<?xml version="1.0" encoding="utf-8"?>

however - when I open the file in a hex editor I can see that the BOM
is set as

FF FE

Which translates to UTF-16 little-endian

- is this a problem - I had assumed that the XML declaration took
precedent ?


Issue 2:

I am getting conflicting data in the input files over a period of
time:

in some input files the character

æ
(LATIN SMALL LETTER AE)
C3 A6 (UTF-8 code-units)
E6 (Hex code units)

is received

... but under other (unknown) circumstances - this character is
received in identical
records on different occasions as:

Ã|

C3 83 C2 A6 (UTF-8 code-units)
C3 A6 (Hex code units)

- depending on how I create the InputSource (see section "parsing
methods")


character:
Ã|
using method [1] and method [2] is parsed successfully



character:
æ
using method [1] and method [2] is NOT parsed successfully

(so the Ã| representation (which is displayed/represented incorrect)
- is parsed correctly, but the æ character causes a SaxParseException
no matter what method I use)

----
Parsing methods

I am processing the file using a DOM parser but I get a
SaxParseExceptions with the message
"Invalid byte 2 of 3-byte UTF-8 sequence"



I create a DOM parser using the following procedure (am including
fully-qualified class names when
types are first encountered for clarity)

javax.xml.parsers.DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);

javax.xml.parsers.DocumentBuilder db = dbf.newDocumentBuilder();

java.io.ByteArrayInputStream bis = XXX;
org.xml.sax.InputSource inputSource = new InputSource(bis);
org.w3c.dom.Document doc = db.parse(inputSource);

- now this is where things get even more confusing - I get failures at
different points
in the XML file depending on the method I use to create the
ByteArrayInputStream object,

Method [1]

byte[] b = <get-the-byte-array-from-file>

ByteArrayInputStream bis = new ByteArrayInputStream (b);

or ...

method [2]

bis = new ByteArrayInputStream (new String(b, "UTF-8").getBytes());

Can anyone shed any light on what's going on here ??

regards
 
R

Richard Tobin

KN said:
Issue 1:

The XML file has the encoding set in the XML declaration:

<?xml version="1.0" encoding="utf-8"?>

however - when I open the file in a hex editor I can see that the BOM
is set as

FF FE

Which translates to UTF-16 little-endian

- is this a problem - I had assumed that the XML declaration took
precedent ?

It's not a question of precedence. The file has to be in one
encoding, and if it's in UTF-8 it can't have FF FE in because that's
not a legal UTF-8 byte sequence.
(LATIN SMALL LETTER AE)
C3 A6 (UTF-8 code-units)
E6 (Hex code units)

is received

... but under other (unknown) circumstances - this character is
received in identical
records on different occasions as:

Ã|

C3 83 C2 A6 (UTF-8 code-units)
C3 A6 (Hex code units)

It appears that you have read UTF-8 text as Latin-1, so that
the two UTF-8 bytes were interpreted as two characters, which
were then written out in UTF-8 as two pairs of two bytes.

-- Richard
 
P

Philippe Poulard

Richard Tobin a écrit :
It's not a question of precedence. The file has to be in one
encoding, and if it's in UTF-8 it can't have FF FE in because that's
not a legal UTF-8 byte sequence.

it is specified somewhere (1) that a BOM can be present in UTF-8 (which
is totally useless IMHO)

thus, if you want to have FE FF in an UTF-8 encoded file, the right
sequence is :
EF BB BF

(2) is a smart tool for playing with unicode


(1) http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
(2) http://people.w3.org/rishida/scripts/uniview/conversion.php

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
 
A

Andreas Prilop

I have something similar but less fancy at
http://www.cogsci.ed.ac.uk/~richard/utf-8.html
It has one extra feature: accepting and displaying the UTF-8 byte
sequence as if it were Latin-1, which is often useful for explaining
mysterious errors.

Just what is "Latin-1"? Is it ISO-8859-1 or Windows-1252 or
something else? And what do you do with undefined code positions
in ISO-8859-1 or Windows-1252?
 
R

Richard Tobin

Andreas Prilop said:
Just what is "Latin-1"? Is it ISO-8859-1
Yes.

And what do you do with undefined code positions
in ISO-8859-1 or Windows-1252?

They'll display as whatever the browser displays them as. I think many
display them as if they were in the Windows encoding, because that's
what's there in the font.

-- Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,878
Messages
2,569,935
Members
46,223
Latest member
SaraK1941

Latest Threads

Top