how to detect that one XML is a subset of another one?

Discussion in 'XML' started by Phlip, Apr 21, 2010.

  1. Phlip

    Phlip Guest

    XMLers:

    How, using XSL or some similar generic tool, can we detect that this
    XML...

    <b> <c d='e' /> </b>

    ....is a subset of this one?

    <ig> <b> <nore> <c d='e' ig='nore'> ig </c> </nore> </b> </ig>

    Note we detect success when c is anywhere inside b, not just its
    immediate child, and c contains at least one attribute d='e'.

    Historically, the best attack I could figure out required recursing
    thru the objects in the reference XML, and for each node creating an
    XPath specifying each node contained certain arguments, and had a
    descendant:: which specified the next inner node.

    For extra credit, this item...

    <b> <c/> <d/> </b>

    ....would NOT match this...

    <b> <d/> <c/> </b>

    ....because the nodes' document order matters.

    So what's the best algorithm here?
     
    Phlip, Apr 21, 2010
    #1
    1. Advertisements

  2. Phlip

    Peter Flynn Guest

    XPath: ig/b[descendant::c[@d='e']]

    eg
    <xsl:template match="ig">
    <xsl:if test="b[descendant::c[@d='e']]">
    ...something...
    </xsl:if>
    <xsl:template match="b">
    <xsl:if test="c/following-sibling::d">
    ...something...
    </xsl:if>
    </xsl:template>

    or, if there mustn't be anything between c and d,

    <xsl:template match="b">
    <xsl:if test="c/following-sibling::*[1][name()='d']">
    ...something...
    </xsl:if>
    </xsl:template>

    ///Peter
     
    Peter Flynn, Apr 22, 2010
    #2
    1. Advertisements

  3. Phlip

    Phlip Guest

    Peter Flynn wrote:

    Txbut!
    Now make it generic, such that /any/ reference XML goes into a
    Generator G, and then G matches the sample XML as if it used such
    XPaths.

    I can write G; I have written one in Ruby and one in Python so far. I
    am asking, if XML is a real format and these XSL things are _almost_
    real languages, am I overlooking some clever meta-pattern in XSL, so I
    can write G in the system that was intended for it? (And I don't mean
    G produces a stack of XPaths; I mean it behaves like them.)

    Another way to state the problem: How to 'diff' two XMLs and return
    only the nodes that differ. If only the sample contains diffs (the
    equivalent of many > ticks and no < ticks on a command-line diff of a
    normal text file), then the reference XML is a subset of the sample.
    So how to write the diff? (_Without_ pretty-printing the XML and
    running a command-line diff - that wouldn't detect nesting correctly!)

    Apologies for the confusing question. The purpose of G is to write
    this assertion:

    self.assert_xml(sample, lambda x:
    x.b(
    x.c(d='e'),
    x.f() # must be after c
    )
    )

    The assertion is easy to write, it skips irrelevant intermediate
    nodes, it skips irrelevant attributes, it detects nesting correctly,
    and it detects relative document order. This would make _unit_ tests
    on generated HTML absurdly easy to write! User-developers can then
    change the HTML art and layout freely, without damaging the tests that
    detect where business logic injects the correct values.
     
    Phlip, Apr 22, 2010
    #3
  4. Not hard to do if you do it in a proper programming language. IBM's
    Alphaworks site had a particularly nice XML diff/merge for a while, but
    it relied upon a nonstandard feature of IBM's original DOM. It could be
    recreated, though, and similar tools do still exist from other sources.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Apr 23, 2010
    #4
  5. Re doing it in XSLT... XSLT is oriented toward tasks where there is only
    one "current node" at a time, and is a nonprocedural language. You
    could make a diff, but doing so would _not_ be easy.

    I'd suggest coding directly against DOM, or against a custom data model.
    XSLT/XQuery aren't intended to be the best answer for every possible
    task; they're just good "swiss army knives" which handle most of the
    most common requirements.
     
    Joe Kesselman, Apr 23, 2010
    #5
  6. Phlip

    Phlip Guest

    Here y'all go, in a proper programming language!

    The complete code, in Python via lxml, is below my sig. The highlights
    are...

    for node in doc.xpath('//*'):
    nodes = [self._node_to_predicate(a) for a in
    node.xpath('ancestor-or-self::*')]
    path = '//' + '/descendant::'.join(nodes)
    node = self.assert_xml(sample, path, **kw)
    location = len(node.xpath('preceding::*'))
    self.assertTrue(doc_order <= location, 'Node out of order!
    ' + path)
    doc_order = location

    That converts each node into an XPath representing its
    characteristics, complete with sufficient predicates and descendant::
    axes to help the XPath skip over irrelevancies.

    The XPaths look like these:

    //XMLPayRequest
    //XMLPayRequest/descendant::RequestData
    //XMLPayRequest/descendant::RequestData/
    descendant::Vendor[ contains(text(), "LOGIN") ]
    //XMLPayRequest/descendant::RequestData/
    descendant::partner[ contains(text(), "PayPal") ]
    //XMLPayRequest/descendant::RequestData/descendant::Transactions
    ....

    The trick with the preceding::* axis seems to count the nodes between
    the head and the current one, so comparing their counts seems to
    enforce document order.

    For extra credit, someone could think of a clever, recursive way to
    cram all that into one humongous XPath. And someone could also find
    away to replace all the "%s" strings with replacement tokens, so
    embedded " marks cause no trouble.

    This method trivially builds each predicate:

    def _node_to_predicate(self, node):
    path = node.tag

    for key, value in node.attrib.items():
    path += '[ contains(@%s, "%s") ]' % (key, value)

    if node.text:
    path += '[ contains(text(), "%s") ]' % node.text

    return path

    So I just need reassurance that I'm indeed at the industry's coal
    face, here, and that nobody else has solved this in some humiliatingly
    trivial way that I should use instead!

    --
    Phlip
    http://zeekland.zeroplayer.com/Pigleg_Too/1

    def _xml_to_tree(self, xml, forgiving=False):
    from lxml import etree
    self._xml = xml

    if not isinstance(xml, basestring):
    self._xml = str(xml)
    return xml

    if '<html' in xml[:200]:
    parser = etree.HTMLParser(recover=forgiving)
    return etree.HTML(str(xml), parser)
    else:
    parser = etree.XMLParser(recover=forgiving)
    return etree.XML(str(xml))

    def assert_xml(self, xml, xpath, **kw):
    'Check that a given extent of XML or HTML contains a given
    XPath, and return its first node'

    if hasattr(xpath, '__call__'):
    return self.assert_xml_tree(xml, xpath, **kw)

    tree = self._xml_to_tree(xml, forgiving=kw.get('forgiving',
    False))
    nodes = tree.xpath(xpath)
    self.assertTrue(len(nodes) > 0, xpath + ' should match ' +
    self._xml)
    node = nodes[0]
    if kw.get('verbose', False): self.reveal_xml(node)
    return node

    def assert_xml_tree(self, sample, block, **kw):
    from lxml.builder import ElementMaker
    doc = block(ElementMaker())
    doc_order = -1

    for node in doc.xpath('//*'):
    nodes = [self._node_to_predicate(a) for a in
    node.xpath('ancestor-or-self::*')]
    path = '//' + '/descendant::'.join(nodes)
    node = self.assert_xml(sample, path, **kw)
    location = len(node.xpath('preceding::*'))
    self.assertTrue(doc_order <= location, 'Node out of order!
    ' + path)
    doc_order = location

    def _node_to_predicate(self, node):
    path = node.tag

    for key, value in node.attrib.items():
    path += '[ contains(@%s, "%s") ]' % (key, value)

    if node.text:
    path += '[ contains(text(), "%s") ]' % node.text

    return path
     
    Phlip, Apr 23, 2010
    #6
  7. Phlip

    Peter Flynn Guest

    It's actually not quite the same as your original question, but:

    onsgmls -wxml xml.dec a.xml >a.esis
    onsgmls -wxml xml.dec b.xml >b.esis
    diff a.esis b.esis

    ///Peter
     
    Peter Flynn, Apr 23, 2010
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.