Pluggability of SAX parsers into DOM in JAXP

Discussion in 'XML' started by erik_midtskogen@anntaylor.com, Nov 29, 2006.

  1. Guest

    Hi Folks,

    I'm writing a general-purpose HTML screen-scraping framework in Java
    (scrape new web sites without writing new code, yada yada...), and I
    want to use the JAXP DOM api along with XPath and XSLT for most of my
    business logic. I actually hope to make this an open-source project if
    I can ever get it to some reasonable level of usability.

    My problem is that, since the slurry pumped out by most web sites bears
    only the faintest resemblance to HTML--let alone XML--I need to use a
    special-purpose SAX parser that is intentionally not fully SAX
    compliant (since it accepts malformed documents).

    I already know how to set the system property for an arbitrary SAX
    parser when programming to the SAX API (i.e. when calling
    SAXParserFactory.newInstance()), and I also know how to specify an
    arbitrary DocumentBuilderFactory when using DOM. So, how do I specify
    the SAX parser that I want DOM to use "behind the scenes"?

    My expectation was that the JAXP DOM implementation should be a client
    of the JAXP SAX implementation. I could be wrong about this, though.
    I'm looking at the code now, and although it's a bit hard to follow
    (and my Eclipse debugger bugs out at just the wrong moment), it appears
    as if the default JAXP DocumentBuilderFactory is hard-coded to use an
    org.apache.xerces.parsers.XML11Configuration as a SAX parser. Weird.
    I could be mistaken about this, but if it's true, then this is not my
    idea of pluggability.

    So here's where I am so far: I wrote a custom SAXParserFactory to
    create an instance of my custom SAX parser, and I plugged it in and
    tested it out using the SAX API and it worked just fine. But then when
    I tried using the DOM API for my XPath/XSLT processing, specifying my
    custom SAXParserFactory as before, I found that the JAXP DOM
    implementation did not use the SAXParserFactory I had specified, and so
    obviously, didn't use the SAX parser I wanted.

    I could try building my own DocumentBuilderFactory, but that looks like
    an awful lot of work just to plug in a SAX parser. Does anyone here
    know of an easier way?

    Much thanks in advance.
    , Nov 29, 2006
    #1
    1. Advertising

  2. wrote:
    > I could try building my own DocumentBuilderFactory, but that looks like
    > an awful lot of work just to plug in a SAX parser. Does anyone here
    > know of an easier way?


    There are many off-the-shelf construct-a-DOM-from-a-SAX-stream
    implementations. Shouldn't be hard to find one if you do a bit of
    websearching. Plug in a SAX parser and a generic DOM implementation and
    push the button.

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    Joseph Kesselman, Nov 29, 2006
    #2
    1. Advertising

  3. Guest

    Thanks Joe,

    Actually, I have tried SAX2DOM from the Xalan project. It works, but
    this utility seems to want to add namespaces to my DOM, and can't turn
    this feature off. Correct though the namespaces may be, they add
    needless complexity to the required XPath expressions and XSLT files
    that are used to configure the framework to scrape a site. I'm trying
    to make my framework as easy to use as possible.

    Also, I like the idea of sticking to the standard SAX and DOM API's
    because I want to keep my options as open as possible by programming to
    interfaces instead of implementation classes. But if there is no easy
    way of setting a system property to tell the standard JAXP DOM
    implementation what SAX parser to use without making a big project out
    of it, then I guess I'll go back to converting the SAX stream to a DOM
    programatically.

    Thanks,
    --Erik


    Joseph Kesselman wrote:
    > wrote:
    > > I could try building my own DocumentBuilderFactory, but that looks like
    > > an awful lot of work just to plug in a SAX parser. Does anyone here
    > > know of an easier way?

    >
    > There are many off-the-shelf construct-a-DOM-from-a-SAX-stream
    > implementations. Shouldn't be hard to find one if you do a bit of
    > websearching. Plug in a SAX parser and a generic DOM implementation and
    > push the button.
    >
    > --
    > Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    , Nov 29, 2006
    #3
  4. wrote:
    > Actually, I have tried SAX2DOM from the Xalan project. It works, but
    > this utility seems to want to add namespaces to my DOM, and can't turn
    > this feature off. Correct though the namespaces may be, they add
    > needless complexity to the required XPath expressions and XSLT files
    > that are used to configure the framework to scrape a site. I'm trying
    > to make my framework as easy to use as possible.


    SAX2DOM shouldn't be adding namespaces unless the namespaces are present
    in the SAX input -- in which case leaving them out is Absolutely
    Incorrect; you'd be changing the meaning of the document (since the
    namespaces are part of the document's semantics) and this bad practice
    *WILL* eventually turn around and bite your kneecaps off.

    Everything should be as simple as possible... but not simpler!

    > But if there is no easy
    > way of setting a system property to tell the standard JAXP DOM
    > implementation what SAX parser to use


    The JAXP DOM path may not be using a SAX parser under the covers -- for
    example, Xerces drives both SAX and DOM output off a lower-level
    representation -- so there really isn't a plug-in point that maps to
    what you're asking for. Using a separate SAX-driven DOM builder really
    is likely to be the most portable solution. It's a pretty simple piece
    of code, and since it's based entirely on the SAX and DOM specs it's
    highly portable.

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    Joseph Kesselman, Nov 29, 2006
    #4
  5. Guest

    Hi Joe,

    OK, I guess I'll go back to programatically performing the conversion
    with a utility. I haven't yet figured out for sure where the
    namespaces are actually coming from. I'll have to look into it.

    While I agree with you that stripping namespaces out would have
    problematic consequences if I were parsing general-purpose xml (and if
    I cared about the element type in which a certain bit of data was
    found), in this particular case it really is safe to ignore them
    because of the nature of what I'm doing. I'm parsing html to scrape
    out textual data. Namespaces aren't normally used in html--in fact,
    not even in xhtml--to distinguish one element type from another. You
    could conceivably use namespaces in xhtml, but there would be no
    practical purpose in doing so. If you did so in a way that assigned an
    element to a namespace other than http://www.w3c.org/TR/xhtml1 (or
    something like that), no user agent would know what to do with it.

    Even if namespaces were customarily used by web browsers to distinguish
    between elements (such as might happen with inline SVG content), it
    still might not make a difference to me because I don't actually care
    what element type the data comes from. I'm really just using XPath and
    XSLT as a more powerful alternative to fishing stuff out of the stream
    using Perl scripting with regular expressions.

    I'm generally pretty anal about this type of thing. Sloppiness and
    ignorance in technical matters drives me crazy. It's one reason I hate
    Microsoft. But in this case, it's more important to me that users of
    my framework be able to write XPath expressions into the configuration
    files without having to specify the same namespace prefix in all their
    location steps. As long as I can write an XPath expression to identify
    navigational elements and XSLT templates to scrape out the content, I'm
    happy.

    Thanks for your help.
    --Erik

    Joseph Kesselman wrote:
    > wrote:
    > > Actually, I have tried SAX2DOM from the Xalan project. It works, but
    > > this utility seems to want to add namespaces to my DOM, and can't turn
    > > this feature off. Correct though the namespaces may be, they add
    > > needless complexity to the required XPath expressions and XSLT files
    > > that are used to configure the framework to scrape a site. I'm trying
    > > to make my framework as easy to use as possible.

    >
    > SAX2DOM shouldn't be adding namespaces unless the namespaces are present
    > in the SAX input -- in which case leaving them out is Absolutely
    > Incorrect; you'd be changing the meaning of the document (since the
    > namespaces are part of the document's semantics) and this bad practice
    > *WILL* eventually turn around and bite your kneecaps off.
    >
    > Everything should be as simple as possible... but not simpler!
    >
    > > But if there is no easy
    > > way of setting a system property to tell the standard JAXP DOM
    > > implementation what SAX parser to use

    >
    > The JAXP DOM path may not be using a SAX parser under the covers -- for
    > example, Xerces drives both SAX and DOM output off a lower-level
    > representation -- so there really isn't a plug-in point that maps to
    > what you're asking for. Using a separate SAX-driven DOM builder really
    > is likely to be the most portable solution. It's a pretty simple piece
    > of code, and since it's based entirely on the SAX and DOM specs it's
    > highly portable.
    >
    > --
    > Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    , Nov 29, 2006
    #5
  6. wrote:
    > Namespaces aren't normally used in html


    HTML is based on SGML, which doesn't have the concept of namespaces.
    XHTML is based on XML, which does.

    > could conceivably use namespaces in xhtml, but there would be no
    > practical purpose in doing so.


    That's absolutely incorrect. Namespaces are essential when XHTML is
    intermixed with other vocabularies -- MathML, SVG, and so on. That's
    becoming more common.

    For that reason, the XHTML elements themselves need to be in the correct
    namespace (http://www.w3c.org/TR/xhtml1, as you pointed out).

    Yes, it may not matter in your particular case. Or it may not matter
    _yet_, which I submit is likely to be a more accurate statement unless
    this is throw-away code.

    > But in this case, it's more important to me that users of
    > my framework be able to write XPath expressions into the configuration
    > files without having to specify the same namespace prefix in all their
    > location steps.


    Alternative suggestion: Use an XPath 2.0/XSLT 2.0 implementation, where
    the concept of default namespace is meaningful. That would let your
    users leave out prefixes yet still get results which are completely
    correct per the standards.




    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 30, 2006
    #6
  7. Joe Kesselman schrieb:
    > For that reason, the XHTML elements themselves need to be in the correct
    > namespace (http://www.w3c.org/TR/xhtml1, as you pointed out).


    The namespace URI for XHTML 1.x is <http://www.w3.org/1999/xhtml>.

    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
    Johannes Koch, Nov 30, 2006
    #7
  8. Johannes Koch wrote:
    > The namespace URI for XHTML 1.x is <http://www.w3.org/1999/xhtml>.


    Blush. Yes. Sorry; copied that from the question and didn't stop to
    recheck it. That's what I get for posting in a hurry...


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 30, 2006
    #8
  9. Guest

    Hi Joe,

    Thank you for your clarifications and suggestion. I will definitely
    look into using XPath/XSLT 2.0. I was looking for a way of
    incorporating a default namespace into XPath expressions and XSLT
    transforms, but was surprised to discover that this concept hadn't been
    addressed in the previous version. It is definitely my preference to
    have the capability of dealing with namespaces in my framework if this
    can be done without making it harder to use for the 99% of the cases
    where namespaces are irrelevant.

    Thanks,
    --Erik

    Joe Kesselman wrote:
    > wrote:
    > > Namespaces aren't normally used in html

    >
    > HTML is based on SGML, which doesn't have the concept of namespaces.
    > XHTML is based on XML, which does.
    >
    > > could conceivably use namespaces in xhtml, but there would be no
    > > practical purpose in doing so.

    >
    > That's absolutely incorrect. Namespaces are essential when XHTML is
    > intermixed with other vocabularies -- MathML, SVG, and so on. That's
    > becoming more common.
    >
    > For that reason, the XHTML elements themselves need to be in the correct
    > namespace (http://www.w3c.org/TR/xhtml1, as you pointed out).
    >
    > Yes, it may not matter in your particular case. Or it may not matter
    > _yet_, which I submit is likely to be a more accurate statement unless
    > this is throw-away code.
    >
    > > But in this case, it's more important to me that users of
    > > my framework be able to write XPath expressions into the configuration
    > > files without having to specify the same namespace prefix in all their
    > > location steps.

    >
    > Alternative suggestion: Use an XPath 2.0/XSLT 2.0 implementation, where
    > the concept of default namespace is meaningful. That would let your
    > users leave out prefixes yet still get results which are completely
    > correct per the standards.
    >
    >
    >
    >
    > --
    > () ASCII Ribbon Campaign | Joe Kesselman
    > /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    , Dec 1, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Araxes Tharsis

    SAX with JAXP

    Araxes Tharsis, Mar 4, 2004, in forum: Java
    Replies:
    6
    Views:
    560
    Araxes Tharsis
    Mar 6, 2004
  2. Thorsten Meininger
    Replies:
    0
    Views:
    429
    Thorsten Meininger
    Jul 28, 2004
  3. Thomas Scheffler

    JAXP:SAX: EntityResolver never used

    Thomas Scheffler, Nov 12, 2003, in forum: XML
    Replies:
    0
    Views:
    623
    Thomas Scheffler
    Nov 12, 2003
  4. Thorsten Meininger
    Replies:
    0
    Views:
    494
    Thorsten Meininger
    Jul 28, 2004
  5. Philippe Poulard

    JAXP : SAX to DOM

    Philippe Poulard, Dec 14, 2004, in forum: XML
    Replies:
    0
    Views:
    378
    Philippe Poulard
    Dec 14, 2004
Loading...

Share This Page