Encoding problem with SAX parser

Discussion in 'Java' started by Martin Schlatter, Dec 10, 2003.

  1. I'm parsing an XML document with a SAX parser.
    I initialise it in the following way:

    javax.xml.parsers.DocumentBuilderFactory docBuilderFactory =
    javax.xml.parsers.DocumentBuilderFactory.newInstance();
    docBuilder = docBuilderFactory.newDocumentBuilder();
    doc = docBuilder.parse(new File(fname));

    But while parsing, I get an exception because their are characters
    which are not valid utf-8 chars. I cannot change the input file. Is
    there any way to skip over the invalid characters? Is there a chance
    to use docBuilder.parse(InputStream) and then skip the invalid
    characters?

    Jens Martin Schlatter

    --
    "Als Mensch bist Du zu dumm und als Schwein hast Du zu kurze Ohren."
    Norbert Gleissner von der Triple-D-Ranch in de.rec.tiere.pferde
    Martin Schlatter, Dec 10, 2003
    #1
    1. Advertising

  2. "Martin Schlatter" <> wrote in message
    news:...
    > I'm parsing an XML document with a SAX parser.
    > I initialise it in the following way:
    >
    > javax.xml.parsers.DocumentBuilderFactory docBuilderFactory =
    > javax.xml.parsers.DocumentBuilderFactory.newInstance();
    > docBuilder = docBuilderFactory.newDocumentBuilder();
    > doc = docBuilder.parse(new File(fname));
    >
    > But while parsing, I get an exception because their are characters
    > which are not valid utf-8 chars. I cannot change the input file.


    Is the file in UTF-8? If not, is it in any valid encoding? If so, try
    replacing your last line with

    org.xml.sax.InputSource src = new InputSource(new
    FileInputStream(fname);
    src.setEncoding(YourEncodingNameGoesHere);
    doc = docBuilder.parse(src);

    If not, you'll have to create a FilterInputStream that removes the bad
    characters and replace your last line with:

    doc = docBuilder.parse(new YourFilterStream(new FileInputStream(fname));
    Mike Schilling, Dec 12, 2003
    #2
    1. Advertising

  3. > Is the file in UTF-8?

    Yes, its UTF-8, but some characters are invalid.

    > If not, you'll have to create a FilterInputStream that removes the bad
    > characters and replace your last line with:
    >
    > doc = docBuilder.parse(new YourFilterStream(new FileInputStream(fname));


    Ok, I see. Thanks! I'll try that!

    Jens Martin Schlatter


    --
    "Als Mensch bist Du zu dumm und als Schwein hast Du zu kurze Ohren."
    Norbert Gleissner von der Triple-D-Ranch in de.rec.tiere.pferde
    Martin Schlatter, Dec 14, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    16,214
    Steve W. Jackson
    Sep 15, 2005
  2. RamaKrishna Narla
    Replies:
    1
    Views:
    651
    Joe Kesselman
    Aug 22, 2006
  3. Åukasz
    Replies:
    2
    Views:
    1,559
    Stefan Behnel
    Aug 7, 2009
  4. Michel Demazure

    Nokogiri SAX parser encoding problem

    Michel Demazure, Aug 24, 2010, in forum: Ruby
    Replies:
    6
    Views:
    342
    Michel Demazure
    Aug 25, 2010
  5. arne
    Replies:
    0
    Views:
    351
Loading...

Share This Page