Highly optimized business-rule validation of very large XML documents possible?

Discussion in 'XML' started by Mike, Nov 21, 2003.

  1. Mike

    Mike Guest

    Related to another topic I just posted, I wanted to discuss ways to optimize
    the validation of very large (>100MB) XML documents.

    First, I have no idea if something like this already exists; it may even be
    the typical implementation for all I know.

    At any rate, it occurs to me that the set of business rules that need to be
    validated against an XML document represent a limited set of nodes at any
    given time (while parsing through the document). For example, if there is a
    parent<->child node dependency, then only the pertinent information related
    to those nodes needs to be kept in memory. Once the dependency has been
    resolved (by validating the rule), the memory associated with those nodes
    could then be freed. In this way, large documents could be validated
    efficiently, by only storing information related to dependencies, and
    immediately freeing memory once the dependency is resolved.

    I don't have a lot of practical XML experience. But I've read, for example,
    that using a SAX parser can be difficult in cases where you need to maintain
    a lot of "state" information. So, what I'm asking is whether there is a
    general solution to this problem, rather than having application-specific
    code to handle the "state" of dependencies?

    It seems to me that rule dependencies could be represented by a "graph",
    similar in some ways to the Java garbage collector. And, like the garbage
    collector, memory could be freed once there are no more "references" to a
    particular dependency. The dependencies themselves would be something like
    "threads" that connect nodes. Larger threads would require more memory.
    Further optimization might be achieved by determining if the dependency
    threads are better suited for a depth-first or breadth-first traversal, or
    some combination.

    In my other post, I ask about whether XML Schema can be used for validation
    of rules like this, or if there are other solutions. In the context of this
    post, does XML Schema or any other method support any of the concepts I talk
    about above?

    If XML Schema is unable to handle rules like this, and there is no other
    available solution, does it make sense that something based on XPath might
    work? I'm wondering if the XPath expressions could be used to represent the
    dependencies (as in what to keep in memory), and then something else would
    actually evaluate the dependency.

    Thanks for any help/suggestions/comments,

    Mike
    Mike, Nov 21, 2003
    #1
    1. Advertising

  2. In article <>,
    Mike <> wrote:

    % I don't have a lot of practical XML experience. But I've read, for example,
    % that using a SAX parser can be difficult in cases where you need to maintain
    % a lot of "state" information. So, what I'm asking is whether there is a

    It's not difficult, it's just that you have to maintain the state information.
    The parser won't do it for you.

    % general solution to this problem, rather than having application-specific
    % code to handle the "state" of dependencies?

    There are SAX parsers which validate against DTDs. They presumably have
    general ways of doing this without saving unneeded nodes. If you find
    a SAX parser which validates against some other schema mechanism, then
    they probably do it reasonably memory-efficiently. I don't think there's
    anything in any of the common schema languages which precludes this.

    % If XML Schema is unable to handle rules like this, and there is no other
    % available solution, does it make sense that something based on XPath might
    % work?

    XPath more-or-less implies that you build a tree of some sort in memory,
    so it works against what you want.
    --

    Patrick TJ McPhee
    East York Canada
    Patrick TJ McPhee, Nov 21, 2003
    #2
    1. Advertising

  3. Mike

    Alain Frisch Guest

    Patrick TJ McPhee, dans le message (comp.text.xml:58709), a écrit :
    > There are SAX parsers which validate against DTDs. They presumably have
    > general ways of doing this without saving unneeded nodes. If you find
    > a SAX parser which validates against some other schema mechanism, then
    > they probably do it reasonably memory-efficiently. I don't think there's
    > anything in any of the common schema languages which precludes this.


    DTD is particularly easy to validate on a SAX stream, because of the
    determinism condition of regular expressions (and the ID/IDREF constraints
    are completely orthogonal, and also easy to check). Concerning XML Schema,
    it is slightly more complex, but still, AFAICT, you can implement
    validation without using complex tree automata (because regular expression
    are actually regular expressions on tag names, not types). I guess the
    memory complexity is linear in the depth of the document.

    > XPath more-or-less implies that you build a tree of some sort in memory,
    > so it works against what you want.


    What are your arguments? Is this only an intuition? A large subset of
    XPath can be rewritten to avoid backward axis, and can be evaluated with a
    top-down left-to-right strategy, compatible with stream evaluation.
    Alain Frisch, Nov 22, 2003
    #3
  4. In article <bpnacc$12e5$>,
    Alain Frisch <> wrote:
    % Patrick TJ McPhee, dans le message (comp.text.xml:58709), a écrit :

    % > XPath more-or-less implies that you build a tree of some sort in memory,
    % > so it works against what you want.
    %
    % What are your arguments? Is this only an intuition?

    Granted `more-or-less implies' sounds a bit like I'm stating a thesis, but
    I'm not categorically saying that it has to be that way. I would not be
    at all surprised to find that one could write an XPath implementation which
    works on a stream. Presumably, you'd want to get all the expressions to
    evaluate up front, so that you don't have to run over the stream once
    for each query, but that's just housekeeping.

    My statement is based a little on XPath's theoretical use of a tree as
    its document representation, and more significantly on my ignorance of
    any XPath implementation which works on anything but a DOM tree. Some
    of the databases put the tree (of some sort) on disk memory rather than
    putting it on the bus, but I don't know of an implementation that works
    against a stream. The OP did not sound like someone who was interested
    in writing an XPath implementation to solve his or her problems, so the
    available implementations are an issue.



    --

    Patrick TJ McPhee
    East York Canada
    Patrick TJ McPhee, Nov 23, 2003
    #4
  5. Mike

    Bob Foster Guest

    Nobody could answer your question without knowing what kind of "business
    rules" you want to check.

    But if the answer is you want to be able to check an arbitrary set of
    XPath-based assertions, then there is only one well-known schema language
    that can handle it - Schematron - and AFAIK all Schematron implementations
    keep the entire document tree in memory (for starters). This doesn't
    necessarily mean you couldn't handle a 100MB document, but that you should
    expect to use a lot of memory to do it.

    If you determine Schematron would meet your needs, I'd recommend trying some
    test cases with, at least, an XSL-based implementation using Saxon as the
    XSL processor. Saxon works very hard (in its tiny tree mode) to keep a
    compact representation of the document. The other consideration is execution
    time, which might also be considerable.

    To answer your question from a theoretical perspective, several authors have
    claimed that all XPath 1.0 expressions can be evaluated in a single
    top-down, left-to-right pass through a document. A google search will turn
    them up. If this is true, and if the context of every XPath expression in a
    schema can be determined statically, it should be possible to do your
    "business rule" checking in a single pass. Temporary storage would still be
    required to validate assertions that call for global or regional
    document-knowledge, like IDs or identity constraints, but except for
    pathological cases that should be relatively small compared to the entire
    document. More important (memory is cheap) it should be fast.

    AFAIK there aren't any of those bad puppies out there, but the expert on
    this subject is Rick Jelliffe, whom you can reach through
    http://www.ascc.net/xml/resource/schematron/schematron.html.

    Bob Foster


    "Mike" <> wrote in message
    news:...
    > Related to another topic I just posted, I wanted to discuss ways to

    optimize
    > the validation of very large (>100MB) XML documents.
    >
    > First, I have no idea if something like this already exists; it may even

    be
    > the typical implementation for all I know.
    >
    > At any rate, it occurs to me that the set of business rules that need to

    be
    > validated against an XML document represent a limited set of nodes at any
    > given time (while parsing through the document). For example, if there is

    a
    > parent<->child node dependency, then only the pertinent information

    related
    > to those nodes needs to be kept in memory. Once the dependency has been
    > resolved (by validating the rule), the memory associated with those nodes
    > could then be freed. In this way, large documents could be validated
    > efficiently, by only storing information related to dependencies, and
    > immediately freeing memory once the dependency is resolved.
    >
    > I don't have a lot of practical XML experience. But I've read, for

    example,
    > that using a SAX parser can be difficult in cases where you need to

    maintain
    > a lot of "state" information. So, what I'm asking is whether there is a
    > general solution to this problem, rather than having application-specific
    > code to handle the "state" of dependencies?
    >
    > It seems to me that rule dependencies could be represented by a "graph",
    > similar in some ways to the Java garbage collector. And, like the garbage
    > collector, memory could be freed once there are no more "references" to a
    > particular dependency. The dependencies themselves would be something

    like
    > "threads" that connect nodes. Larger threads would require more memory.
    > Further optimization might be achieved by determining if the dependency
    > threads are better suited for a depth-first or breadth-first traversal, or
    > some combination.
    >
    > In my other post, I ask about whether XML Schema can be used for

    validation
    > of rules like this, or if there are other solutions. In the context of

    this
    > post, does XML Schema or any other method support any of the concepts I

    talk
    > about above?
    >
    > If XML Schema is unable to handle rules like this, and there is no other
    > available solution, does it make sense that something based on XPath might
    > work? I'm wondering if the XPath expressions could be used to represent

    the
    > dependencies (as in what to keep in memory), and then something else would
    > actually evaluate the dependency.
    >
    > Thanks for any help/suggestions/comments,
    >
    > Mike
    >
    >
    >
    >
    Bob Foster, Nov 23, 2003
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jurjen de Groot

    Re: Business rule implementation

    Jurjen de Groot, Aug 6, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    342
    Jurjen de Groot
    Aug 6, 2003
  2. Al Fatykhov
    Replies:
    1
    Views:
    498
    Roedy Green
    Feb 17, 2006
  3. Mike
    Replies:
    1
    Views:
    1,140
    Patrick TJ McPhee
    Nov 21, 2003
  4. Replies:
    0
    Views:
    1,347
  5. Replies:
    4
    Views:
    359
    Christian Bau
    Feb 11, 2006
Loading...

Share This Page