Problem using local xhtml DTD when parsing file with DocumentBuilder

Discussion in 'Java' started by Ryan McFall, Jun 13, 2007.

  1. Ryan McFall

    Ryan McFall Guest

    Hi:

    I've got some XHTML documents that I'm using the classes in
    java.xml.xpath to find certain tags. These documents contain a DTD
    declaration for XHTML, with a public identifier. Since my application
    needs to work without a network connection, I've downloaded the DTD
    and associated entities and made them available to my application as
    resources. I then set an EntityResolver the document builder that I
    get from DocumentBuilderFactory.newInstance(). Here's the relevant
    code from the resolveEntity method:

    url = getClass().getResource (identifierMap.get(publicId));
    return new InputSource (url.toString());

    When I run the application, I get the following message from the
    parser:
    com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
    Invalid byte 1 of 1-byte UTF-8 sequence.

    After browsing around a bit, I tried:

    url = getClass().getResource (identifierMap.get(publicId));
    FileReader reader = new FileReader (new File (url.toURI()));
    return new InputSource (reader);

    but this had the same problem.

    I downloaded the files from the W3C site, both by using FireFox and by
    using wget. In both cases I get the same behavior.

    I don't know much about character encodings, so I'm at a loss as to
    what to try next. Any suggestions would be greatly appreciated.

    Ryan
     
    Ryan McFall, Jun 13, 2007
    #1
    1. Advertising

  2. Ryan McFall

    Lew Guest

    Ryan McFall wrote:
    > Hi:
    >
    > I've got some XHTML documents that I'm using the classes in
    > java.xml.xpath to find certain tags. These documents contain a DTD
    > I get the following message from the parser:
    > com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
    > Invalid byte 1 of 1-byte UTF-8 sequence.


    Ideally, all XML documents should be in UTF-8 encoding. Apparently the DTD or
    your XML file isn't. When they aren't, the XML declaration should specify the
    encoding.

    > After browsing around a bit, I tried:
    >
    > url = getClass().getResource (identifierMap.get(publicId));
    > FileReader reader = new FileReader (new File (url.toURI()));
    > return new InputSource (reader);
    >
    > but this had the same problem.


    Have you considered using
    <http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.nio.charset.Charset)>
    ?

    This will let you specify the document encoding to match how it's stored.

    --
    Lew
     
    Lew, Jun 13, 2007
    #2
    1. Advertising

  3. Ryan McFall

    Ryan McFall Guest

    Pardon my stupidity - the XML file was saved by someone else, and
    apparently it was saved as something other than UTF-8. Re-saving it
    into UTF-8 solved my problem.

    Ryan
     
    Ryan McFall, Jun 13, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joseph Tilian
    Replies:
    0
    Views:
    380
    Joseph Tilian
    Dec 21, 2004
  2. Replies:
    0
    Views:
    2,028
  3. bugbear
    Replies:
    0
    Views:
    1,071
    bugbear
    Aug 28, 2003
  4. test
    Replies:
    2
    Views:
    2,191
    Oliver Wong
    Jul 28, 2006
  5. John L.
    Replies:
    3
    Views:
    949
    Stanimir Stamenkov
    Jan 1, 2013
Loading...

Share This Page