SAX succeeds, but StAX fails

Discussion in 'Java' started by Kai Schlamp, Mar 6, 2008.

  1. Kai Schlamp

    Kai Schlamp Guest

    Hy!

    I tried to parse PubMed (a biomedical article database) with SAX and
    also StAX. The last one failed, but I am not sure why (see Exception
    below).
    Why does SAX succeed and StAX don't?
    The XML document seems to be fine (see
    http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml)
    Any suggestions?

    Kai

    StAX example:
    String address = "http://www.ncbi.nlm.nih.gov/entrez/
    eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml";
    URL url = new URL(address);

    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader parser =
    factory.createXMLStreamReader(url.openConnection().getInputStream());

    while(parser.hasNext()) {
    switch(parser.getEventType()) {
    }
    parser.next();
    }

    Error message:
    javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
    Message: A '(' character or an element type is required in the
    declaration of element type "PubMedPubDate".

    SAX example:
    SAXParserFactory parserFactory =
    SAXParserFactory.newInstance();
    parserFactory.setValidating(true);
    parserFactory.setNamespaceAware(true);
    SAXParser parser = parserFactory.newSAXParser();
    parser.parse(url.openConnection().getInputStream(), new
    PubmedEFetchHandler());

    (PubmedEFetchHander is a simple DefaultHandler with some debugging
    output).
     
    Kai Schlamp, Mar 6, 2008
    #1
    1. Advertising

  2. Kai Schlamp

    GArlington Guest

    On Mar 6, 12:57 pm, Kai Schlamp <> wrote:
    > Hy!
    >
    > I tried to parse PubMed (a biomedical article database) with SAX and
    > also StAX. The last one failed, but I am not sure why (see Exception
    > below).
    > Why does SAX succeed and StAX don't?
    > The XML document seems to be fine (seehttp://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11...)


    As far as I can see this request DOES NOT generate valid xml (or any
    xml).

    > Any suggestions?
    >
    > Kai
    >
    > StAX example:
    > String address = "http://www.ncbi.nlm.nih.gov/entrez/
    > eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml";
    > URL url = new URL(address);
    >
    > XMLInputFactory factory = XMLInputFactory.newInstance();
    > XMLStreamReader parser =
    > factory.createXMLStreamReader(url.openConnection().getInputStream());
    >
    > while(parser.hasNext()) {
    > switch(parser.getEventType()) {
    > }
    > parser.next();
    > }
    >
    > Error message:
    > javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
    > Message: A '(' character or an element type is required in the
    > declaration of element type "PubMedPubDate".
    >
    > SAX example:
    > SAXParserFactory parserFactory =
    > SAXParserFactory.newInstance();
    > parserFactory.setValidating(true);
    > parserFactory.setNamespaceAware(true);
    > SAXParser parser = parserFactory.newSAXParser();
    > parser.parse(url.openConnection().getInputStream(), new
    > PubmedEFetchHandler());
    >
    > (PubmedEFetchHander is a simple DefaultHandler with some debugging
    > output).
     
    GArlington, Mar 6, 2008
    #2
    1. Advertising

  3. Kai Schlamp

    Kai Schlamp Guest

    Kai Schlamp, Mar 6, 2008
    #3
  4. Kai Schlamp

    Kai Schlamp Guest

    Ok, I checked the new link again and the problem remains. When I click
    the link and it opens in Firefox, it is indeed no XML.
    But when you then press the "Go To" button (green button on the right
    of the url input field), then the valid XML appears. I am not sure why
    this happens, but it doesn't have to do something with my original
    problem. Seems to be a little Firefox problem.


    On 6 Mrz., 17:49, Kai Schlamp <> wrote:
    > Seems to be a posting converting error (I am posting through google
    > groups).
    > The link in your message doesn't contain the retmode=xml anymore.
    > Please try this url:www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11748933&...
    > It should generate valid XML.
     
    Kai Schlamp, Mar 6, 2008
    #4
  5. Kai Schlamp

    GArlington Guest

    On Mar 6, 5:01 pm, Kai Schlamp <> wrote:
    > Ok, I checked the new link again and the problem remains. When I click
    > the link and it opens in Firefox, it is indeed no XML.
    > But when you then press the "Go To" button (green button on the right
    > of the url input field), then the valid XML appears. I am not sure why
    > this happens, but it doesn't have to do something with my original
    > problem. Seems to be a little Firefox problem.
    >
    > On 6 Mrz., 17:49, Kai Schlamp <> wrote:
    >
    > > Seems to be a posting converting error (I am posting through google
    > > groups).
    > > The link in your message doesn't contain the retmode=xml anymore.
    > > Please try this url:www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11748933&...
    > > It should generate valid XML.


    OK, I tried accessing it with IE and it worked first time, I thought
    that I gave it a try in IE yesterday too, but...
    I fetched your url and parsed it (with my own methods) and it works,
    so I suspect that there is a problem with StAX...
    The only thing I can suggest is: try to dump what you get from your
    url BEFORE you try to parse it and then dump the data at each step
    until you get to your error - this will help you to find where the
    problem first shows it's ugly head...
     
    GArlington, Mar 7, 2008
    #5
  6. Kai Schlamp

    Kai Schlamp Guest

    I still have the same problem with StAX. I dumped the output of the
    url before parsing it, and it seems to be fine and well formed.
    But parsing with StAX still gives me an exception right in the first
    loop (SAX seems to work fine).
    Below is a small test class. Can someone explain to me, why this
    happens?
    I also tried to copy the output of the url in a file and parsing it
    directly from disk ... didn't solve that problem.
    Perhaps I should try it with another StAX provider. I found one on the
    net named Woodstox. Are there any more? What is the default
    implementation? An Apache project?

    The error output of the below test class:

    START_DOCUMENT: 1.0
    beforeNext
    javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
    Message: A '(' character or an element type is required in the
    declaration of element type "PubMedPubDate".
    at
    com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:
    588)
    at StaxTester.main(StaxTester.java:49)

    The test class:

    import java.net.URL;
    import javax.xml.stream.XMLInputFactory;
    import javax.xml.stream.XMLStreamConstants;
    import javax.xml.stream.XMLStreamReader;

    public class StaxTester {

    public static void main(String[] args) {
    try {
    String address = "http://www.ncbi.nlm.nih.gov/entrez/eutils/
    efetch.fcgi?db=pubmed&retmode=xml&id=11748933";
    //String address = "http://www.ncbi.nlm.nih.gov/entrez/eutils/
    esearch.fcgi?db=pmc&term=stem+cells+AND+free+fulltext[filter]";
    URL url = new URL(address);

    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader parser =
    factory.createXMLStreamReader(url.openConnection().getInputStream());

    while(parser.hasNext()) {
    switch(parser.getEventType()) {
    case XMLStreamConstants.START_DOCUMENT:
    System.out.println( "START_DOCUMENT: " +
    parser.getVersion() );
    break;

    case XMLStreamConstants.END_DOCUMENT:
    System.out.println( "END_DOCUMENT: " );
    parser.close();
    break;

    case XMLStreamConstants.NAMESPACE:
    System.out.println( "NAMESPACE: " +
    parser.getNamespaceURI() );
    break;

    case XMLStreamConstants.START_ELEMENT:
    System.out.println( "START_ELEMENT: " +
    parser.getLocalName() );
    break;

    case XMLStreamConstants.CHARACTERS:
    if ( ! parser.isWhiteSpace() )
    System.out.println( "CHARACTERS: " + parser.getText() );
    break;

    case XMLStreamConstants.END_ELEMENT:
    System.out.println("END_ELEMENT: " +
    parser.getLocalName() );
    break;

    default:
    break;
    }
    System.out.println("beforeNext");
    parser.next();
    System.out.println("afterNext");
    }

    /** SAX succeeds. Why that? */
    // SAXParserFactory parserFactory = SAXParserFactory.newInstance();
    // parserFactory.setValidating(true);
    // parserFactory.setNamespaceAware(true);
    // SAXParser parser = parserFactory.newSAXParser();
    // parser.parse(url.openConnection().getInputStream(), new
    PubmedEFetchHandler());
    //
    }
    catch (Exception e) {
    e.printStackTrace();
    }

    }

    }
     
    Kai Schlamp, Mar 12, 2008
    #6
  7. On Mar 6, 8:57 am, Kai Schlamp <> wrote:
    > Hy!
    >
    > I tried to parse PubMed (a biomedical article database) with SAX and
    > also StAX. The last one failed, but I am not sure why (see Exception
    > below).
    > Why does SAX succeed and StAX don't?
    > The XML document seems to be fine (seehttp://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11...)
    > Any suggestions?
    >


    ...

    >             String address = "http://www.ncbi.nlm.nih.gov/entrez/
    > eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml";
    >             URL url = new URL(address);


    ...

    > Error message:
    > javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
    > Message: A '(' character or an element type is required in the
    > declaration of element type "PubMedPubDate".


    The XML document itself is fine, but non-validating due to problems in
    the DTD; StAX by default attempts to validate input documents. SAX is
    ignoring the DTD associated with the XML document, and therefore
    doesn't notice that the DTD is invalid.

    -o
     
    Owen Jacobson, Mar 12, 2008
    #7
  8. Kai Schlamp

    Kai Schlamp Guest

    On Mar 12, 10:27 pm, Owen Jacobson <> wrote:
    > On Mar 6, 8:57 am, Kai Schlamp <> wrote:
    >
    > > Hy!

    >
    > > I tried to parse PubMed (a biomedical article database) with SAX and
    > > also StAX. The last one failed, but I am not sure why (see Exception
    > > below).
    > > Why does SAX succeed and StAX don't?
    > > The XML document seems to be fine (seehttp://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11...)
    > > Any suggestions?

    >
    > ...
    >
    > > String address = "http://www.ncbi.nlm.nih.gov/entrez/
    > > eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml";
    > > URL url = new URL(address);

    >
    > ...
    >
    > > Error message:
    > > javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
    > > Message: A '(' character or an element type is required in the
    > > declaration of element type "PubMedPubDate".

    >
    > The XML document itself is fine, but non-validating due to problems in
    > the DTD; StAX by default attempts to validate input documents. SAX is
    > ignoring the DTD associated with the XML document, and therefore
    > doesn't notice that the DTD is invalid.
    >
    > -o


    Thanks for the answer.
    So disabling DTD validation should solve that problem?
    I tried
    factory.setProperty("javax.xml.stream.isValidating", false);
    (which is the default as stated in the Javadoc), but it also didn't
    solve the problem.

    Another thing ... I just tried the Woodstox implementation (just added
    it to the classpath), and everything works fine (even without changing
    any property). So it seems, that there is a specific problem with the
    reference implementation.
     
    Kai Schlamp, Mar 12, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joergen Bech
    Replies:
    0
    Views:
    495
    Joergen Bech
    Jun 30, 2005
  2. Replies:
    0
    Views:
    794
  3. Rolf Schroedter

    Accessing large >2GB file succeeds fails with open/read

    Rolf Schroedter, Feb 15, 2005, in forum: C Programming
    Replies:
    6
    Views:
    540
    Olof Lagerkvist
    Feb 16, 2005
  4. Matthew Brett
    Replies:
    4
    Views:
    1,144
    Matthew Brett
    May 9, 2010
  5. Ross
    Replies:
    5
    Views:
    103
Loading...

Share This Page