K
KN
I am having problems with some Danish characters within some XML files
I am processing
using javax.xml.parsers.DocumentBuilder
I am having several related probelms - which I will now list
Issue 1:
The XML file has the encoding set in the XML declaration:
<?xml version="1.0" encoding="utf-8"?>
however - when I open the file in a hex editor I can see that the BOM
is set as
FF FE
Which translates to UTF-16 little-endian
- is this a problem - I had assumed that the XML declaration took
precedent ?
Issue 2:
I am getting conflicting data in the input files over a period of
time:
in some input files the character
æ
(LATIN SMALL LETTER AE)
C3 A6 (UTF-8 code-units)
E6 (Hex code units)
is received
... but under other (unknown) circumstances - this character is
received in identical
records on different occasions as:
Ã|
C3 83 C2 A6 (UTF-8 code-units)
C3 A6 (Hex code units)
- depending on how I create the InputSource (see section "parsing
methods")
character:
Ã|
using method [1] and method [2] is parsed successfully
character:
æ
using method [1] and method [2] is NOT parsed successfully
(so the Ã| representation (which is displayed/represented incorrect)
- is parsed correctly, but the æ character causes a SaxParseException
no matter what method I use)
----
Parsing methods
I am processing the file using a DOM parser but I get a
SaxParseExceptions with the message
"Invalid byte 2 of 3-byte UTF-8 sequence"
I create a DOM parser using the following procedure (am including
fully-qualified class names when
types are first encountered for clarity)
javax.xml.parsers.DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
javax.xml.parsers.DocumentBuilder db = dbf.newDocumentBuilder();
java.io.ByteArrayInputStream bis = XXX;
org.xml.sax.InputSource inputSource = new InputSource(bis);
org.w3c.dom.Document doc = db.parse(inputSource);
- now this is where things get even more confusing - I get failures at
different points
in the XML file depending on the method I use to create the
ByteArrayInputStream object,
Method [1]
byte[] b = <get-the-byte-array-from-file>
ByteArrayInputStream bis = new ByteArrayInputStream (b);
or ...
method [2]
bis = new ByteArrayInputStream (new String(b, "UTF-8").getBytes());
Can anyone shed any light on what's going on here ??
regards
I am processing
using javax.xml.parsers.DocumentBuilder
I am having several related probelms - which I will now list
Issue 1:
The XML file has the encoding set in the XML declaration:
<?xml version="1.0" encoding="utf-8"?>
however - when I open the file in a hex editor I can see that the BOM
is set as
FF FE
Which translates to UTF-16 little-endian
- is this a problem - I had assumed that the XML declaration took
precedent ?
Issue 2:
I am getting conflicting data in the input files over a period of
time:
in some input files the character
æ
(LATIN SMALL LETTER AE)
C3 A6 (UTF-8 code-units)
E6 (Hex code units)
is received
... but under other (unknown) circumstances - this character is
received in identical
records on different occasions as:
Ã|
C3 83 C2 A6 (UTF-8 code-units)
C3 A6 (Hex code units)
- depending on how I create the InputSource (see section "parsing
methods")
character:
Ã|
using method [1] and method [2] is parsed successfully
character:
æ
using method [1] and method [2] is NOT parsed successfully
(so the Ã| representation (which is displayed/represented incorrect)
- is parsed correctly, but the æ character causes a SaxParseException
no matter what method I use)
----
Parsing methods
I am processing the file using a DOM parser but I get a
SaxParseExceptions with the message
"Invalid byte 2 of 3-byte UTF-8 sequence"
I create a DOM parser using the following procedure (am including
fully-qualified class names when
types are first encountered for clarity)
javax.xml.parsers.DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
javax.xml.parsers.DocumentBuilder db = dbf.newDocumentBuilder();
java.io.ByteArrayInputStream bis = XXX;
org.xml.sax.InputSource inputSource = new InputSource(bis);
org.w3c.dom.Document doc = db.parse(inputSource);
- now this is where things get even more confusing - I get failures at
different points
in the XML file depending on the method I use to create the
ByteArrayInputStream object,
Method [1]
byte[] b = <get-the-byte-array-from-file>
ByteArrayInputStream bis = new ByteArrayInputStream (b);
or ...
method [2]
bis = new ByteArrayInputStream (new String(b, "UTF-8").getBytes());
Can anyone shed any light on what's going on here ??
regards