Help needed parsing a UTF-8 XML file

Discussion in 'Java' started by Huzefa, Sep 4, 2004.

  1. Huzefa

    Huzefa Guest

    I have a XML file encoded in UTF-8. The parser works fine when
    there are only English characters in the file.

    However, when I PUT SOME Chinese characters in the file, I get the
    following error:

    org.xml.sax.SAXParseException: Content is not allowed in prolog.
    org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    com.xyz.pqr.ParseXmlFile.<init>(ParseXmlFile.java:34)
    org.apache.jsp.index3_jsp._jspService(index3_jsp.java:59)
    org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
    org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
    org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
    org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

    I am setting the character encoding of the InputSource.
    My code for doing so lokks like this:

    InputSource input = new InputSource(file); //File is the FileReader
    input.setEncoding("UTF-8");

    DOMParser parser = new DOMParser();
    parser.parse(input);

    How can I get it to read Chinese/Japanese characters?

    Any help would be appreciated.

    Thanx

    Huzefa Khalil
    Huzefa, Sep 4, 2004
    #1
    1. Advertising

  2. "Huzefa" <> wrote in message
    news:...
    > I have a XML file encoded in UTF-8. The parser works fine when
    > there are only English characters in the file.
    >
    > However, when I PUT SOME Chinese characters in the file, I get the
    > following error:
    >
    > org.xml.sax.SAXParseException: Content is not allowed in prolog.
    > org.apache.xerces.parsers.DOMParser.parse(Unknown Source)


    The error suggests the XML may not be well-formed.

    It would be easier to diagnose this by looking at a set of sample XML files.
    Can you upload some samples to a server somewhere with public access? Or
    send me a zip file, email to kmc(at)world.std.com.

    /kmc
    Keith M. Corbett, Sep 4, 2004
    #2
    1. Advertising

  3. Huzefa () wrote:
    : I have a XML file encoded in UTF-8. The parser works fine when
    : there are only English characters in the file.

    : However, when I PUT SOME Chinese characters in the file, I get the
    : following error:

    : org.xml.sax.SAXParseException: Content is not allowed in prolog.

    Perhaps you put some white space at the top of the file. The <? must be
    the very first thing, and perhaps no white space before the first tag's <
    either.
    Malcolm Dew-Jones, Sep 4, 2004
    #3
  4. "Malcolm Dew-Jones" <> wrote in message
    news:...
    > Huzefa () wrote:
    > : I have a XML file encoded in UTF-8. The parser works fine when
    > : there are only English characters in the file.
    >
    > : However, when I PUT SOME Chinese characters in the file, I get the
    > : following error:
    >
    > : org.xml.sax.SAXParseException: Content is not allowed in prolog.
    >
    > Perhaps you put some white space at the top of the file. The <? must be
    > the very first thing, [snip]


    I believe a Unicode Byte Order Mark (BOM) may precede the XML declaration.
    Per the XML 1.1 TR:

    "Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin
    with the Byte Order Mark described in ISO/IEC 10646" etc.

    > and perhaps no white space before the first tag's <
    > either.


    I believe white space may appear in the prolog, after the XML declaration
    and before or after the document type declaration.

    [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?

    [27] Misc ::= Comment | PI | S

    /kmc
    Keith M. Corbett, Sep 5, 2004
    #4
  5. Huzefa

    Chris Uppal Guest

    Huzefa wrote:

    > InputSource input = new InputSource(file); //File is the FileReader
    > input.setEncoding("UTF-8");


    From the JavaDoc for org.xml.sax.InputSource.setEncoding():

    This method has no effect when the application provides a character stream.

    which may be your problem, since you are providing a character stream in your
    constructor. There's more information in the intro to the class in the same
    JavaDoc.

    BTW, on the subject of the BOM (which someone mentioned elsewhere in this
    thread) the JavaDoc for that constructor states:

    The character stream shall not include a byte order mark.

    HTH.

    -- chris
    Chris Uppal, Sep 5, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    1,179
    Mike Schilling
    Jun 29, 2007
  2. Replies:
    0
    Views:
    989
  3. moonhkt
    Replies:
    18
    Views:
    2,511
    Roedy Green
    Feb 5, 2010
  4. Kioko --
    Replies:
    3
    Views:
    290
    Walton Hoops
    Mar 24, 2010
  5. Erik Wasser
    Replies:
    5
    Views:
    445
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page