Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.



Hi All,

While I'm trying to use some UTF-8 characters in my xml while parsing
the xml using JDOM parser I'm getting this below exception:

Malformed XML, Caused by: 'Invalid byte 2 of 4-byte UTF-8 sequence.'
at com.clarify.boss.utility.xml.SimpleXmlParser.build
at com.clarify.boss.msf.handler.RespHeaderInitiateHandler.execute
at com.clarify.boss.msf.support.ServiceFaultPublisherAB.executeImpl
at com.clarify.boss.common.base.BossActionBeanBase.execute
at com.clarify.boss.sa.msf.xbean.InvokeResponseXB.executeImpl
at com.clarify.cbo.XBeanImpl.baselineExecuteImpl_(XBeanImpl.java:275)
at com.amdocs.oss.sm.core.common.XBeanBase.baselineExecuteImpl_
at com.clarify.cbo.XBeanImpl.execute(XBeanImpl.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
at sun.reflect.DelegatingMethodAccessorImpl.invoke
at java.lang.reflect.Method.invoke(Method.java:615)
at com.clarify.sam.JavaDispatch.invokeMethodImp(JavaDispatch.java:
at com.clarify.sam.JavaDispatch.invokeMethod(JavaDispatch.java:348)
at com.clarify.sam.ActionBeanService.invokeBeanMethod
at com.clarify.sam.ActionBeanService.invokeAifOperation
at com.clarify.sam.AppFrameworkBindingHandler.executeOperation
at com.amdocs.aif.consumer.ServiceContext.executeWithRetries
at com.amdocs.aif.consumer.ServiceContext.executeOperationImpl
at com.amdocs.aif.consumer.ServiceContext.executeOperation
at com.amdocs.aif.consumer.ServiceContext.executeOperation
... 35 more
Caused by: org.jdom.input.JDOMParseException: Error on line 72:
Invalid byte 2 of 4-byte UTF-8 sequence.
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:770)
at com.clarify.boss.utility.xml.SimpleXmlParser.build
... 60 more
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte
UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException
(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl
$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument
(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
... 62 more

I have declared the encoding to be used while parsing, in my xml as
<?xml version="1.0" encoding="UTF-8"?>

Initially I doubted that the xml backup had some problem because on
the same application server while I was trying to use the same xml as
input it worked but from one of my friends machine it didn't. So is
this could be the cause?

But now I have even something more interesting out of all this. I
tried changing the encoding to ISO-8859-1 i.e. : <?xml version="1.0"
encoding="ISO-8859-1"?> & to surprise it worked.

Now this has led to a confusion. I thought ISO-8859-1 is a charset
which is subset of UTF-8. Then why didn't UTF-8 work whereas
ISO-8859-1 worked?

And lastly I can't change this encoding in my xml as in turn I would
have to do all the regression once again on my application. So please
let me know where I have gone wrong.

The Java code that I'm using is:

* (non-Javadoc)
/ *
* @see com.clarify.boss.utility.xml.XmlParser#build
public Document build(Resource source) {
try {
return (getSystemId() == null ? getSaxBuilder().build
(source.getInputStream()) : getSaxBuilder().build(
source.getInputStream(), getSystemId()));
} catch (Exception e) {
BossErrorCode bossErrorCode = new BossErrorCode
throw new BossException(bossErrorCode, new String[] {e.getCause

the sax builder method is:

* Getter method for the <b>saxBuilder </b> property
* @return Returns the saxBuilder.
private PropertyAwareSAXBuilder getSaxBuilder() {
if (saxBuilder == null) {

PropertyAwareSAXBuilder myParser = new PropertyAwareSAXBuilder(

schema", isValidate());

//CatalogResolver myResolver = new CatalogResolver();

CatalogResolver myResolver = getCatalogResolver();


Iterator it = getProperties().keySet().iterator();
while (it.hasNext()) {
String name = (String) it.next();
saxBuilder.setProperty(name, getProperties().get(name));
return saxBuilder;


Roedy Green

While I'm trying to use some UTF-8 characters in my xml while parsing
the xml using JDOM parser I'm getting this below exception:

Partition your problem. Is it that the file is malformed or is the
problem getting the XML parser to understand the file is in UTF-8

You can examine your file in a hex viewer if you are familiar with
UTF-8 encoding, or you could feed it to the Sun utility native2ascii
to see if it likes it.

See http://mindprod.com/jgloss/utf.html

You could also give up and use entities (NCRs).
see http://mindprod.com/jgloss/xml.html#AWKWARD


Partition your problem.  Is it that the file is malformed or is the
problem getting the XML parser to understand the file is in UTF-8

You can examine your file in a hex viewer if you are familiar with
UTF-8 encoding, or you could feed it to the Sun utility native2ascii
to see if it likes it.


You could also give up and use entities (NCRs).

@BugBear: yeah the xml is a well formed and properly validated xml.

@Roedy: write now I'm using ultraEdit and inserting the characters
from the ASCII table that it has. I have even tried seeing it in hex
mode and I got the same value from both the places.

Meanwhile I have found something more interesting while reading the
input stream from my xml if I exclusively define it to be formatted to
UTF-8 in getByteStream it is working fine. Now here is this a Java bug
( or something else?

Mike Schilling

It may be a clue that 4-byte UTE-8 sequences only occur with
surrogates, which there are two reasonable ways to encode:

1. Encode the code point as 4 bytes
2. Encode each 16-bit "char" as 3 bytes

Only 1 is correct, but I'm sure there's lots of non-surrogate-aware
code that does 2.


dk said:
@BugBear: yeah the xml [sic] is a well formed and properly validated xml [sic].

That didn't answer his question. Answer his question.
"Have you checked that your data IS valid UTF-8 ?"

Clearly there is an improperly-encoded character in your XML file.
Find that and fix it.
@Roedy: write now I'm using ultraEdit and inserting the characters
from the ASCII table that it has. I have even tried seeing it in hex
mode and I got the same value from both the places.


That hex value for the bad character, does it match the UTF-8 code
point for that character? It's four bytes long? What character is
it, and what is the hex value you observe? (Note: that's four
questions, so there ought to be four answers.)
Meanwhile I have found something more interesting while reading the
input stream from my xml [sic] if I exclusively define it to be formatted to
UTF-8 in getByteStream it is working fine. Now here is this a Java bug
( or something else?

It's not a Java bug.
Now this has led to a confusion. I thought ISO-8859-1 is a charset

Did you mean "encoding"?
which is subset of UTF-8. Then why didn't UTF-8 work whereas
ISO-8859-1 worked?

Because you were wrong. The two encodings differ.

If you have an assumption, let's call it an hypothesis, and the
evidence contradicts the hypothesis, then the hypothesis is wrong.

Arne Vajhøj

Meanwhile I have found something more interesting while reading the
input stream from my xml if I exclusively define it to be formatted to
UTF-8 in getByteStream it is working fine. Now here is this a Java bug
( or something else?

If you post the XML input and the Java code, then we can
tell you.


Roedy Green

@Roedy: write now I'm using ultraEdit and inserting the characters
from the ASCII table that it has. I have even tried seeing it in hex
mode and I got the same value from both the places.

You need to know what the hex SHOULD look like.
See http://mindprod.com/jgloss/utf8.html

You need a tool to see what it DOES look like.
See http://www.sweetscape.com/010editor/

And a tool to validate the encoding:

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Latest member

Latest Threads
