XML-Parsing with UTF-8 Byte-Order-Mark (BOM)

Discussion in 'Java' started by Patrick.Gebhardt@gmail.com, Jun 25, 2007.

  1. Guest

    Hello,

    i have a really weird problem.

    The environment is a client - server application, where the client
    reads an UTF-8 encoded XML file (with cyrillic characters e.g.) which
    is then send to the server, where it is parsed in 2 different ways -
    first using a normal SaxParser then via Castor (which is using the
    _same_ parser library)

    relevant Libs: xercesImpl 2.9.0, castor 0.9.5

    The client-XML file is UTF-8 with BOM (hex: EB BB BF).

    The client sends this file via a commons-httpclient POST call to the
    server using the correct content-type.
    I ensure on the server side, that the file is received correcly, i can
    read the cyrrilic characters in the logfile after doing the following
    in the servlet:

    the following is obviously pseudoCode:

    doPost() {
    request.setCharacterEncoding("UTF8");
    InputStream in = request.getInputStream();

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    byte[] buffer = new byte[1024];

    int count = in.read(buffer);
    while( count != -1) {
    baos.write(buffer, 0, count);
    count = in.read(buffer);
    }

    byte[] xml = baos.toByteArray();
    String s = new String(xml, "UTF8"); --> string is correct, contains
    cyrrillic characters

    --- until here, everything is fine.

    --- Now i have to parse the xml to find a node-attribute and decide
    upon the value into which
    --- castor classes i have to unmarshal the XML.
    --- To be able to call castor, i need a second input stream which
    castor will be using.
    --- therefore i copy the byte[] and create a second stream.
    --- (the files are really small, therefore i dont expect memory
    problems)

    byte[] xmlCastor = new byte[xml.length];
    System.arraycopy(xml, 0, xmlCastor, 0, xml.length);

    ByteArrayInputStream bais = new ByteArrayInputStream(xml);
    ByteArrayInputStream baisCastor = new
    ByteArrayInputStream(xmlCastor);

    -- i can verify in the logfile, that these 2 byte arrays contain the
    same cyrillic characters.

    -- now i call the SaxParser with the first stream, and i receive the
    node attribute.
    -- then i pass the second stream to castor ... and bummer ...

    Caused by: org.xml.sax.SAXException: Parsing Error: Content is not
    allowed in prolob.

    -- that is because of the byte-order mark, the Parser does not like
    it.
    -- 2 identical streams (as far as i can tell) called by the same
    parser ... one runs into an exception,
    -- the second does not

    -- I have _exactly one_ Parser in my Tomcat in WEB-INF/Lib, and that
    is xercesImpl-2.9.0.jar.
    -- Is it somehow possible that Tomcat provides a different version ? I
    cannot verify how Castor is
    -- choosing his XML parser, but i do it the following way:

    SAXParserFactory pf = SAXParserFactory.newInstance();
    XMLReader parser = pf.newSAXParser().getXMLReader();
    parser.parse(new InputSource(bais));

    Any helpful Tips appreciated!

    P.S: i can't change very much of the infrastructure ... Castor e.g is
    definitly a set condition.
    , Jun 25, 2007
    #1
    1. Advertising

  2. Create a FilterInputStream that doesn't pass the BOM through, and parse
    *that*.
    Mike Schilling, Jun 25, 2007
    #2
    1. Advertising

  3. Roedy Green Guest

    On Mon, 25 Jun 2007 15:50:23 -0000, wrote,
    quoted or indirectly quoted someone who said :

    >-- that is because of the byte-order mark, the Parser does not like
    >it.


    Java i/o is not clever enough to read the byte order mark, adjust its
    notion of the encoding and discard it as would be done in heaven.

    Instead it just passes it through to the app as a single character.
    So all application software needs to just discard such characters.

    Have a look at the FilterInputStream and FilterOutputStream. It
    should be possible to insert a layer of filtering to discard such
    characters.
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 29, 2007
    #3
  4. "Roedy Green" <> wrote in message
    news:...
    > On Mon, 25 Jun 2007 15:50:23 -0000, wrote,
    > quoted or indirectly quoted someone who said :
    >
    >>-- that is because of the byte-order mark, the Parser does not like
    >>it.

    >
    > Java i/o is not clever enough to read the byte order mark, adjust its
    > notion of the encoding and discard it as would be done in heaven.


    It does with UTF-16, I think. The BOM for UTF-8 is a Microsoft invention,
    not a standard. (Though Java should handle it anyway; interoperability is
    more important than corporate politics.)
    Mike Schilling, Jun 29, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    995
  2. twyk
    Replies:
    2
    Views:
    408
    Terry Reedy
    Aug 25, 2008
  3. Tim Perrett
    Replies:
    1
    Views:
    213
    Tim Perrett
    Jul 25, 2007
  4. Yohan N. Leder

    How to mark UTF-8 string as being UTF-8

    Yohan N. Leder, Jun 2, 2006, in forum: Perl Misc
    Replies:
    9
    Views:
    123
    Alan J. Flavell
    Jun 5, 2006
  5. Replies:
    2
    Views:
    381
    Nathan Keel
    Aug 14, 2009
Loading...

Share This Page