XML-Parsing with UTF-8 Byte-Order-Mark (BOM)

Discussion in 'Java' started by Patrick.Gebhardt@gmail.com, Jun 25, 2007.

  1. Guest


    i have a really weird problem.

    The environment is a client - server application, where the client
    reads an UTF-8 encoded XML file (with cyrillic characters e.g.) which
    is then send to the server, where it is parsed in 2 different ways -
    first using a normal SaxParser then via Castor (which is using the
    _same_ parser library)

    relevant Libs: xercesImpl 2.9.0, castor 0.9.5

    The client-XML file is UTF-8 with BOM (hex: EB BB BF).

    The client sends this file via a commons-httpclient POST call to the
    server using the correct content-type.
    I ensure on the server side, that the file is received correcly, i can
    read the cyrrilic characters in the logfile after doing the following
    in the servlet:

    the following is obviously pseudoCode:

    doPost() {
    InputStream in = request.getInputStream();

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    byte[] buffer = new byte[1024];

    int count = in.read(buffer);
    while( count != -1) {
    baos.write(buffer, 0, count);
    count = in.read(buffer);

    byte[] xml = baos.toByteArray();
    String s = new String(xml, "UTF8"); --> string is correct, contains
    cyrrillic characters

    --- until here, everything is fine.

    --- Now i have to parse the xml to find a node-attribute and decide
    upon the value into which
    --- castor classes i have to unmarshal the XML.
    --- To be able to call castor, i need a second input stream which
    castor will be using.
    --- therefore i copy the byte[] and create a second stream.
    --- (the files are really small, therefore i dont expect memory

    byte[] xmlCastor = new byte[xml.length];
    System.arraycopy(xml, 0, xmlCastor, 0, xml.length);

    ByteArrayInputStream bais = new ByteArrayInputStream(xml);
    ByteArrayInputStream baisCastor = new

    -- i can verify in the logfile, that these 2 byte arrays contain the
    same cyrillic characters.

    -- now i call the SaxParser with the first stream, and i receive the
    node attribute.
    -- then i pass the second stream to castor ... and bummer ...

    Caused by: org.xml.sax.SAXException: Parsing Error: Content is not
    allowed in prolob.

    -- that is because of the byte-order mark, the Parser does not like
    -- 2 identical streams (as far as i can tell) called by the same
    parser ... one runs into an exception,
    -- the second does not

    -- I have _exactly one_ Parser in my Tomcat in WEB-INF/Lib, and that
    is xercesImpl-2.9.0.jar.
    -- Is it somehow possible that Tomcat provides a different version ? I
    cannot verify how Castor is
    -- choosing his XML parser, but i do it the following way:

    SAXParserFactory pf = SAXParserFactory.newInstance();
    XMLReader parser = pf.newSAXParser().getXMLReader();
    parser.parse(new InputSource(bais));

    Any helpful Tips appreciated!

    P.S: i can't change very much of the infrastructure ... Castor e.g is
    definitly a set condition.
    , Jun 25, 2007
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    Mike Schilling
    Jun 29, 2007
  2. twyk
    Terry Reedy
    Aug 25, 2008
  3. Tim Perrett
    Tim Perrett
    Jul 25, 2007
  4. Yohan N. Leder

    How to mark UTF-8 string as being UTF-8

    Yohan N. Leder, Jun 2, 2006, in forum: Perl Misc
    Alan J. Flavell
    Jun 5, 2006
  5. Replies:
    Nathan Keel
    Aug 14, 2009

Share This Page