Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.

dk · Jan 21, 2010

Hi All,

While I'm trying to use some UTF-8 characters in my xml while parsing
the xml using JDOM parser I'm getting this below exception:

Malformed XML, Caused by: 'Invalid byte 2 of 4-byte UTF-8 sequence.'
at com.clarify.boss.utility.xml.SimpleXmlParser.build
(SimpleXmlParser.java:236)
at
com.clarify.boss.msf.handler.RespHeaderInitiateHandler.getStandardHeader
(RespHeaderInitiateHandler.java:366)
at com.clarify.boss.msf.handler.RespHeaderInitiateHandler.execute
(RespHeaderInitiateHandler.java:289)
at
com.clarify.boss.utility.appcontroller.support.AbstractHandler.execute
(AbstractHandler.java:42)
at
com.clarify.boss.utility.appcontroller.support.ApplicationControllerImpl.handleRequest
(ApplicationControllerImpl.java:174)
at
com.clarify.boss.utility.appcontroller.support.ApplicationControllerImpl.execute
(ApplicationControllerImpl.java:311)
at com.clarify.boss.msf.support.ServiceFaultPublisherAB.executeImpl
(ServiceFaultPublisherAB.java:87)
at com.clarify.boss.common.base.BossActionBeanBase.execute
(BossActionBeanBase.java:125)
at com.clarify.boss.sa.msf.xbean.InvokeResponseXB.executeImpl
(InvokeResponseXB.java:198)
at com.clarify.cbo.XBeanImpl.baselineExecuteImpl_(XBeanImpl.java:275)
at com.amdocs.oss.sm.core.common.XBeanBase.baselineExecuteImpl_
(XBeanBase.java:75)
at com.clarify.cbo.XBeanImpl.execute(XBeanImpl.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:64)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at com.clarify.sam.JavaDispatch.invokeMethodImp(JavaDispatch.java:
396)
at com.clarify.sam.JavaDispatch.invokeMethod(JavaDispatch.java:348)
at com.clarify.sam.ActionBeanService.invokeBeanMethod
(ActionBeanService.java:509)
at com.clarify.sam.ActionBeanService.invokeAifOperation
(ActionBeanService.java:128)
at com.clarify.sam.AppFrameworkBindingHandler.executeOperation
(AppFrameworkBindingHandler.java:69)
at com.amdocs.aif.consumer.ServiceContext.executeWithRetries
(ServiceContext.java:900)
at com.amdocs.aif.consumer.ServiceContext.executeOperationImpl
(ServiceContext.java:756)
at com.amdocs.aif.consumer.ServiceContext.executeOperation
(ServiceContext.java:676)
at com.amdocs.aif.consumer.ServiceContext.executeOperation
(ServiceContext.java:323)
at
com.clarify.boss.errorhandler.resolver.ResolverLauncherSynchXB.executeImpl
(ResolverLauncherSynchXB.java:157)
... 35 more
Caused by: org.jdom.input.JDOMParseException: Error on line 72:
Invalid byte 2 of 4-byte UTF-8 sequence.
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:770)
at com.clarify.boss.utility.xml.SimpleXmlParser.build
(SimpleXmlParser.java:231)
... 60 more
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte
UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException
(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl
$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument
(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
... 62 more

I have declared the encoding to be used while parsing, in my xml as
UTF-8:
<?xml version="1.0" encoding="UTF-8"?>

Initially I doubted that the xml backup had some problem because on
the same application server while I was trying to use the same xml as
input it worked but from one of my friends machine it didn't. So is
this could be the cause?

But now I have even something more interesting out of all this. I
tried changing the encoding to ISO-8859-1 i.e. : <?xml version="1.0"
encoding="ISO-8859-1"?> & to surprise it worked.

Now this has led to a confusion. I thought ISO-8859-1 is a charset
which is subset of UTF-8. Then why didn't UTF-8 work whereas
ISO-8859-1 worked?

And lastly I can't change this encoding in my xml as in turn I would
have to do all the regression once again on my application. So please
let me know where I have gone wrong.

The Java code that I'm using is:

/*
* (non-Javadoc)
/ *
* @see com.clarify.boss.utility.xml.XmlParser#build
(org.springframework.core.io.Resource)
*/
public Document build(Resource source) {
try {
return (getSystemId() == null ? getSaxBuilder().build
(source.getInputStream()) : getSaxBuilder().build(
source.getInputStream(), getSystemId()));
} catch (Exception e) {
e.printStackTrace();
BossErrorCode bossErrorCode = new BossErrorCode
(ErrorCode.BOSS_MALFORMED_XML);
throw new BossException(bossErrorCode, new String[] {e.getCause
().getMessage()},e);
}
}

the sax builder method is:

/**
* Getter method for the <b>saxBuilder </b> property
*
* @return Returns the saxBuilder.
*/
private PropertyAwareSAXBuilder getSaxBuilder() {
if (saxBuilder == null) {

PropertyAwareSAXBuilder myParser = new PropertyAwareSAXBuilder(
isValidate());

myParser.setFeature("http://apache.org/xml/features/validation/
schema", isValidate());
myParser.setFeature("http://xml.org/sax/features/namespaces",
true);

//CatalogResolver myResolver = new CatalogResolver();

CatalogResolver myResolver = getCatalogResolver();

myParser.setEntityResolver(myResolver);
setSaxBuilder(myParser);

Iterator it = getProperties().keySet().iterator();
while (it.hasNext()) {
String name = (String) it.next();
saxBuilder.setProperty(name, getProperties().get(name));
}
}
return saxBuilder;
}

Regards,
Dhirendra

Roedy Green · Jan 21, 2010

While I'm trying to use some UTF-8 characters in my xml while parsing
the xml using JDOM parser I'm getting this below exception:

Partition your problem. Is it that the file is malformed or is the
problem getting the XML parser to understand the file is in UTF-8
encoding?

You can examine your file in a hex viewer if you are familiar with
UTF-8 encoding, or you could feed it to the Sun utility native2ascii
to see if it likes it.

See http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/encoding.html

You could also give up and use entities (NCRs).
see http://mindprod.com/jgloss/xml.html#AWKWARD

dk · Jan 21, 2010

Partition your problem. Is it that the file is malformed or is the
problem getting the XML parser to understand the file is in UTF-8
encoding?

You can examine your file in a hex viewer if you are familiar with
UTF-8 encoding, or you could feed it to the Sun utility native2ascii
to see if it likes it.

Seehttp://mindprod.com/jgloss/utf.htmlhttp://mindprod.com/jgloss/encoding..html

You could also give up and use entities (NCRs).
seehttp://mindprod.com/jgloss/xml.html#AWKWARD

@BugBear: yeah the xml is a well formed and properly validated xml.

@Roedy: write now I'm using ultraEdit and inserting the characters
from the ASCII table that it has. I have even tried seeing it in hex
mode and I got the same value from both the places.

Meanwhile I have found something more interesting while reading the
input stream from my xml if I exclusively define it to be formatted to
UTF-8 in getByteStream it is working fine. Now here is this a Java bug
(1.5.0.12)? or something else?

Mike Schilling · Jan 21, 2010

It may be a clue that 4-byte UTE-8 sequences only occur with
surrogates, which there are two reasonable ways to encode:

1. Encode the code point as 4 bytes
2. Encode each 16-bit "char" as 3 bytes

Only 1 is correct, but I'm sure there's lots of non-surrogate-aware
code that does 2.

Lew · Jan 21, 2010

dk said:
@BugBear: yeah the xml [sic] is a well formed and properly validated xml [sic].

That didn't answer his question. Answer his question.
"Have you checked that your data IS valid UTF-8 ?"

Clearly there is an improperly-encoded character in your XML file.
Find that and fix it.

@Roedy: write now I'm using ultraEdit and inserting the characters
from the ASCII table that it has. I have even tried seeing it in hex
mode and I got the same value from both the places.

ASCII != UTF-8.

That hex value for the bad character, does it match the UTF-8 code
point for that character? It's four bytes long? What character is
it, and what is the hex value you observe? (Note: that's four
questions, so there ought to be four answers.)

Meanwhile I have found something more interesting while reading the
input stream from my xml [sic] if I exclusively define it to be formatted to
UTF-8 in getByteStream it is working fine. Now here is this a Java bug
(1.5.0.12)? or something else?

It's not a Java bug.

Now this has led to a confusion. I thought ISO-8859-1 is a charset

Did you mean "encoding"?

which is subset of UTF-8. Then why didn't UTF-8 work whereas
ISO-8859-1 worked?

Because you were wrong. The two encodings differ.

If you have an assumption, let's call it an hypothesis, and the
evidence contradicts the hypothesis, then the hypothesis is wrong.
Simple.

Arne Vajhøj · Jan 22, 2010

Meanwhile I have found something more interesting while reading the
input stream from my xml if I exclusively define it to be formatted to
UTF-8 in getByteStream it is working fine. Now here is this a Java bug
(1.5.0.12)? or something else?

If you post the XML input and the Java code, then we can
tell you.

Arne

Roedy Green · Jan 22, 2010

@Roedy: write now I'm using ultraEdit and inserting the characters
from the ASCII table that it has. I have even tried seeing it in hex
mode and I got the same value from both the places.

You need to know what the hex SHOULD look like.
See http://mindprod.com/jgloss/utf8.html

You need a tool to see what it DOES look like.
See http://www.sweetscape.com/010editor/
http://funduc.com/otsoft.htm#hexview

And a tool to validate the encoding:
http://mindprod.com/jgloss/native2asciiexe.html
http://mindprod.com/applet/ecodingrecogniser.html

slice! invalid byte sequence in UTF-8	9	Mar 3, 2011
Invalid byte 2 of 3-byte UTF-8 sequence - inconsistent behavior	6	Nov 15, 2007
InputStream - invalid byte 1 of 1-byte UTF-8 sequence	2	Dec 27, 2004
XML and Invalid byte UTF-8	7	May 9, 2005
UTF-8 problems with windows	31	Aug 10, 2009
XML-Parsing with UTF-8 Byte-Order-Mark (BOM)	3	Jun 25, 2007
XML-Parsing with UTF-8 Byte-Order-Mark (BOM)	0	Jun 25, 2007
Query regarding Catalog resolver 'cvc-elt.1: Cannot find thedeclaration of element'	9	Jul 16, 2009

Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.

dk

Roedy Green

dk

Mike Schilling

Lew

Arne Vajhøj

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads