Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.

Discussion in 'Java' started by dk, Jan 21, 2010.

  1. dk

    dk Guest

    Hi All,

    While I'm trying to use some UTF-8 characters in my xml while parsing
    the xml using JDOM parser I'm getting this below exception:

    Malformed XML, Caused by: 'Invalid byte 2 of 4-byte UTF-8 sequence.'
    at com.clarify.boss.utility.xml.SimpleXmlParser.build
    (SimpleXmlParser.java:236)
    at
    com.clarify.boss.msf.handler.RespHeaderInitiateHandler.getStandardHeader
    (RespHeaderInitiateHandler.java:366)
    at com.clarify.boss.msf.handler.RespHeaderInitiateHandler.execute
    (RespHeaderInitiateHandler.java:289)
    at
    com.clarify.boss.utility.appcontroller.support.AbstractHandler.execute
    (AbstractHandler.java:42)
    at
    com.clarify.boss.utility.appcontroller.support.ApplicationControllerImpl.handleRequest
    (ApplicationControllerImpl.java:174)
    at
    com.clarify.boss.utility.appcontroller.support.ApplicationControllerImpl.execute
    (ApplicationControllerImpl.java:311)
    at com.clarify.boss.msf.support.ServiceFaultPublisherAB.executeImpl
    (ServiceFaultPublisherAB.java:87)
    at com.clarify.boss.common.base.BossActionBeanBase.execute
    (BossActionBeanBase.java:125)
    at com.clarify.boss.sa.msf.xbean.InvokeResponseXB.executeImpl
    (InvokeResponseXB.java:198)
    at com.clarify.cbo.XBeanImpl.baselineExecuteImpl_(XBeanImpl.java:275)
    at com.amdocs.oss.sm.core.common.XBeanBase.baselineExecuteImpl_
    (XBeanBase.java:75)
    at com.clarify.cbo.XBeanImpl.execute(XBeanImpl.java:197)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke
    (NativeMethodAccessorImpl.java:64)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke
    (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:615)
    at com.clarify.sam.JavaDispatch.invokeMethodImp(JavaDispatch.java:
    396)
    at com.clarify.sam.JavaDispatch.invokeMethod(JavaDispatch.java:348)
    at com.clarify.sam.ActionBeanService.invokeBeanMethod
    (ActionBeanService.java:509)
    at com.clarify.sam.ActionBeanService.invokeAifOperation
    (ActionBeanService.java:128)
    at com.clarify.sam.AppFrameworkBindingHandler.executeOperation
    (AppFrameworkBindingHandler.java:69)
    at com.amdocs.aif.consumer.ServiceContext.executeWithRetries
    (ServiceContext.java:900)
    at com.amdocs.aif.consumer.ServiceContext.executeOperationImpl
    (ServiceContext.java:756)
    at com.amdocs.aif.consumer.ServiceContext.executeOperation
    (ServiceContext.java:676)
    at com.amdocs.aif.consumer.ServiceContext.executeOperation
    (ServiceContext.java:323)
    at
    com.clarify.boss.errorhandler.resolver.ResolverLauncherSynchXB.executeImpl
    (ResolverLauncherSynchXB.java:157)
    ... 35 more
    Caused by: org.jdom.input.JDOMParseException: Error on line 72:
    Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:770)
    at com.clarify.boss.utility.xml.SimpleXmlParser.build
    (SimpleXmlParser.java:231)
    ... 60 more
    Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte
    UTF-8 sequence.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException
    (Unknown Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
    Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
    Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
    Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl
    $FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument
    (Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
    Source)
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
    ... 62 more

    I have declared the encoding to be used while parsing, in my xml as
    UTF-8:
    <?xml version="1.0" encoding="UTF-8"?>

    Initially I doubted that the xml backup had some problem because on
    the same application server while I was trying to use the same xml as
    input it worked but from one of my friends machine it didn't. So is
    this could be the cause?

    But now I have even something more interesting out of all this. I
    tried changing the encoding to ISO-8859-1 i.e. : <?xml version="1.0"
    encoding="ISO-8859-1"?> & to surprise it worked.

    Now this has led to a confusion. I thought ISO-8859-1 is a charset
    which is subset of UTF-8. Then why didn't UTF-8 work whereas
    ISO-8859-1 worked?

    And lastly I can't change this encoding in my xml as in turn I would
    have to do all the regression once again on my application. So please
    let me know where I have gone wrong.

    The Java code that I'm using is:

    /*
    * (non-Javadoc)
    / *
    * @see com.clarify.boss.utility.xml.XmlParser#build
    (org.springframework.core.io.Resource)
    */
    public Document build(Resource source) {
    try {
    return (getSystemId() == null ? getSaxBuilder().build
    (source.getInputStream()) : getSaxBuilder().build(
    source.getInputStream(), getSystemId()));
    } catch (Exception e) {
    e.printStackTrace();
    BossErrorCode bossErrorCode = new BossErrorCode
    (ErrorCode.BOSS_MALFORMED_XML);
    throw new BossException(bossErrorCode, new String[] {e.getCause
    ().getMessage()},e);
    }
    }

    the sax builder method is:

    /**
    * Getter method for the <b>saxBuilder </b> property
    *
    * @return Returns the saxBuilder.
    */
    private PropertyAwareSAXBuilder getSaxBuilder() {
    if (saxBuilder == null) {

    PropertyAwareSAXBuilder myParser = new PropertyAwareSAXBuilder(
    isValidate());

    myParser.setFeature("http://apache.org/xml/features/validation/
    schema", isValidate());
    myParser.setFeature("http://xml.org/sax/features/namespaces",
    true);

    //CatalogResolver myResolver = new CatalogResolver();

    CatalogResolver myResolver = getCatalogResolver();

    myParser.setEntityResolver(myResolver);
    setSaxBuilder(myParser);

    Iterator it = getProperties().keySet().iterator();
    while (it.hasNext()) {
    String name = (String) it.next();
    saxBuilder.setProperty(name, getProperties().get(name));
    }
    }
    return saxBuilder;
    }

    Regards,
    Dhirendra
    dk, Jan 21, 2010
    #1
    1. Advertising

  2. dk

    Roedy Green Guest

    On Thu, 21 Jan 2010 02:13:27 -0800 (PST), dk <>
    wrote, quoted or indirectly quoted someone who said :

    >
    >While I'm trying to use some UTF-8 characters in my xml while parsing
    >the xml using JDOM parser I'm getting this below exception:


    Partition your problem. Is it that the file is malformed or is the
    problem getting the XML parser to understand the file is in UTF-8
    encoding?

    You can examine your file in a hex viewer if you are familiar with
    UTF-8 encoding, or you could feed it to the Sun utility native2ascii
    to see if it likes it.

    See http://mindprod.com/jgloss/utf.html
    http://mindprod.com/jgloss/encoding.html

    You could also give up and use entities (NCRs).
    see http://mindprod.com/jgloss/xml.html#AWKWARD
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, “How would I develop if it were my money?” I’m amazed how many theoretical arguments evaporate when faced with this question.
    ~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .
    Roedy Green, Jan 21, 2010
    #2
    1. Advertising

  3. dk

    dk Guest

    On Jan 21, 6:26 pm, Roedy Green <>
    wrote:
    > On Thu, 21 Jan 2010 02:13:27 -0800 (PST), dk <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >
    >
    > >While I'm trying to use some UTF-8 characters in my xml while parsing
    > >the xml using JDOM parser I'm getting this below exception:

    >
    > Partition your problem.  Is it that the file is malformed or is the
    > problem getting the XML parser to understand the file is in UTF-8
    > encoding?
    >
    > You can examine your file in a hex viewer if you are familiar with
    > UTF-8 encoding, or you could feed it to the Sun utility native2ascii
    > to see if it likes it.
    >
    > Seehttp://mindprod.com/jgloss/utf.htmlhttp://mindprod.com/jgloss/encoding..html
    >
    > You could also give up and use entities (NCRs).
    > seehttp://mindprod.com/jgloss/xml.html#AWKWARD
    > --
    > Roedy Green Canadian Mind Productshttp://mindprod.com
    > Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, How would I develop if it were my money? I m amazed how many theoretical arguments evaporate when faced with this question.
    > ~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .



    @BugBear: yeah the xml is a well formed and properly validated xml.

    @Roedy: write now I'm using ultraEdit and inserting the characters
    from the ASCII table that it has. I have even tried seeing it in hex
    mode and I got the same value from both the places.

    Meanwhile I have found something more interesting while reading the
    input stream from my xml if I exclusively define it to be formatted to
    UTF-8 in getByteStream it is working fine. Now here is this a Java bug
    (1.5.0.12)? or something else?
    dk, Jan 21, 2010
    #3
  4. It may be a clue that 4-byte UTE-8 sequences only occur with
    surrogates, which there are two reasonable ways to encode:

    1. Encode the code point as 4 bytes
    2. Encode each 16-bit "char" as 3 bytes

    Only 1 is correct, but I'm sure there's lots of non-surrogate-aware
    code that does 2.
    Mike Schilling, Jan 21, 2010
    #4
  5. dk

    Lew Guest

    dk wrote:
    > @BugBear: yeah the xml [sic] is a well formed and properly validated xml [sic].
    >


    That didn't answer his question. Answer his question.
    "Have you checked that your data IS valid UTF-8 ?"

    Clearly there is an improperly-encoded character in your XML file.
    Find that and fix it.

    > @Roedy: write now I'm using ultraEdit and inserting the characters
    > from the ASCII table that it has. I have even tried seeing it in hex
    > mode and I got the same value from both the places.
    >


    ASCII != UTF-8.

    That hex value for the bad character, does it match the UTF-8 code
    point for that character? It's four bytes long? What character is
    it, and what is the hex value you observe? (Note: that's four
    questions, so there ought to be four answers.)

    > Meanwhile I have found something more interesting while reading the
    > input stream from my xml [sic] if I exclusively define it to be formatted to
    > UTF-8 in getByteStream it is working fine. Now here is this a Java bug
    > (1.5.0.12)? or something else?
    >


    It's not a Java bug.

    > Now this has led to a confusion. I thought ISO-8859-1 is a charset


    Did you mean "encoding"?

    > which is subset of UTF-8. Then why didn't UTF-8 work whereas
    > ISO-8859-1 worked?
    >


    Because you were wrong. The two encodings differ.

    If you have an assumption, let's call it an hypothesis, and the
    evidence contradicts the hypothesis, then the hypothesis is wrong.
    Simple.

    --
    Lew
    Lew, Jan 21, 2010
    #5
  6. dk

    Arne Vajhøj Guest

    On 21-01-2010 10:03, dk wrote:
    > Meanwhile I have found something more interesting while reading the
    > input stream from my xml if I exclusively define it to be formatted to
    > UTF-8 in getByteStream it is working fine. Now here is this a Java bug
    > (1.5.0.12)? or something else?


    If you post the XML input and the Java code, then we can
    tell you.

    Arne
    Arne Vajhøj, Jan 22, 2010
    #6
  7. dk

    Roedy Green Guest

    On Thu, 21 Jan 2010 07:03:23 -0800 (PST), dk <>
    wrote, quoted or indirectly quoted someone who said :

    >@Roedy: write now I'm using ultraEdit and inserting the characters
    >from the ASCII table that it has. I have even tried seeing it in hex
    >mode and I got the same value from both the places.


    You need to know what the hex SHOULD look like.
    See http://mindprod.com/jgloss/utf8.html

    You need a tool to see what it DOES look like.
    See http://www.sweetscape.com/010editor/
    http://funduc.com/otsoft.htm#hexview

    And a tool to validate the encoding:
    http://mindprod.com/jgloss/native2asciiexe.html
    http://mindprod.com/applet/ecodingrecogniser.html


    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, “How would I develop if it were my money?” I’m amazed how many theoretical arguments evaporate when faced with this question.
    ~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .
    Roedy Green, Jan 22, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Arun
    Replies:
    2
    Views:
    14,479
    William Brogden
    Dec 27, 2004
  2. Patrick.O.Ige
    Replies:
    1
    Views:
    1,952
    Patrick.O.Ige
    Jul 2, 2006
  3. KN
    Replies:
    6
    Views:
    20,308
    Richard Tobin
    Nov 15, 2007
  4. Luther
    Replies:
    15
    Views:
    625
    Jason O.
    Nov 10, 2010
  5. Marek Kis
    Replies:
    9
    Views:
    305
    Marek Kis
    Mar 4, 2011
Loading...

Share This Page