lxml, comparing nodes

C

code_berzerker

I'd like to know if there is any built in mechanism in lxml that lets
you check equality of two nodes from separate documents. I'd like it
to ignore attribute order and so on. It would be even better if there
was built in method for checking equality of whole documents (ignoring
document order). Please let me know if you know of such method or
existing scipt. I dont like reinventing the wheel :)
 
S

Sebastian \lunar\ Wiesner

code_berzerker said:
I'd like to know if there is any built in mechanism in lxml that lets
you check equality of two nodes from separate documents. I'd like it
to ignore attribute order and so on. It would be even better if there
was built in method for checking equality of whole documents (ignoring
document order). Please let me know if you know of such method or
existing scipt. I dont like reinventing the wheel :)

Did you try the equality operator?
 
S

Stefan Behnel

code_berzerker said:
I'd like to know if there is any built in mechanism in lxml that lets
you check equality of two nodes from separate documents.

No, because, as you state yourself, equality is not something that everyone
defines the same way.

I'd like it
to ignore attribute order and so on. It would be even better if there
was built in method for checking equality of whole documents (ignoring
document order). Please let me know if you know of such method or
existing scipt. I dont like reinventing the wheel :)

Your requirements for a single Element are simple enough to write it in three
to five lines of Python code (depending on your definition of equality).
Checking this equality recursively is another two to three lines. Not complex
enough to be considered a wheel in the first place.

Stefan
 
C

code_berzerker

Your requirements for a single Element are simple enough to write it in three
to five lines of Python code (depending on your definition of equality).
Checking this equality recursively is another two to three lines. Not complex
enough to be considered a wheel in the first place.

Forgive my ignorance as I am new to both Python and lxml ;)
 
F

Fredrik Lundh

code_berzerker said:
Forgive my ignorance as I am new to both Python and lxml ;)

off the top of my head (untested):
.... if a.tag != b.tag or a.attrib != b.attrib:
.... return False
.... if a.text != b.text or a.tail != b.tail:
.... return False
.... if len(a) != len(b):
.... return False
.... if any(not equal(a, b) for a, b in zip(a, b)):
.... return False
.... return True

this should work for arbitrary ET implementations (lxmk, xml.etree, ET,
etc). tweak as necessary.

</F>
 
C

code_berzerker

off the top of my head (untested):
 >>> def equal(a, b):
...     if a.tag != b.tag or a.attrib != b.attrib:
...         return False
...     if a.text != b.text or a.tail != b.tail:
...         return False
...     if len(a) != len(b):
...         return False
...     if any(not equal(a, b) for a, b in zip(a, b)):
...         return False
...     return True

this should work for arbitrary ET implementations (lxmk, xml.etree, ET,
etc).  tweak as necessary.

</F>

Thanks for help. Thats inspiring, tho not exactly what I need, coz
ignoring document order is requirement (ignoring changes in order of
different siblings of the same type, etc). I plan to try something
like that:

def xmlCmp(xmlStr1, xmlStr2):
et1 = etree.XML(xmlStr1)
et2 = etree.XML(xmlStr2)

queue = []
tmpq = deque([et1])
tmpq2 = deque([et2])

while tmpq:
el = tmpq.popleft()
tmpq.extend(el)
queue.append(el.tag)

while queue:
el = queue.pop()
foundEl = findMatchingElem(el, et2)
if foundEl:
et1.remove(el)
tmpq2.remove(foundEl)
else:
return False

if len(tmpq2) == 0:
return True
else:
return False


def findMatchingElem(el, eTree):
for elem in eTree:
if elemCmp(el, elem):
return elem
return None


def elemCmp(el1, el2):
pass # yet to be implemented ;)
 
S

Stefan Behnel

code_berzerker said:
Thanks for help. Thats inspiring, tho not exactly what I need, coz
ignoring document order is requirement (ignoring changes in order of
different siblings of the same type, etc). I plan to try something
like that:

def xmlCmp(xmlStr1, xmlStr2):
et1 = etree.XML(xmlStr1)
et2 = etree.XML(xmlStr2)

queue = []
tmpq = deque([et1])
tmpq2 = deque([et2])

while tmpq:
el = tmpq.popleft()
tmpq.extend(el)
queue.append(el.tag)

while queue:
el = queue.pop()
foundEl = findMatchingElem(el, et2)
if foundEl:
et1.remove(el)
tmpq2.remove(foundEl)
else:
return False

if len(tmpq2) == 0:
return True
else:
return False

If document order doesn't matter, try sorting the elements of each level in
the two documents by some arbitrary deterministic key, such as (tag name,
text, attr count, whatever), and then compare them in order, instead of trying
to find matches in multiple passes. itertools.groupby() might be your friend here.

Stefan
 
C

code_berzerker

If document order doesn't matter, try sorting the elements of each level in
the two documents by some arbitrary deterministic key, such as (tag name,
text, attr count, whatever), and then compare them in order, instead of trying
to find matches in multiple passes. itertools.groupby() might be your friend here.

I think that sorting multiple times by each attribute will cost more
than I've managed to do:

from lxml import etree
from collections import deque
import string, re, time

def xmlEqual(xmlStr1, xmlStr2):
et1 = etree.XML(xmlStr1)
et2 = etree.XML(xmlStr2)

let1 = [x for x in et1.iter()]
let2 = [x for x in et2.iter()]

if len(let1) != len(let2):
return False

while let1:
el = let1.pop(0)
foundEl = findMatchingElem(el, let2)
if foundEl is None:
return False
let2.remove(foundEl)
return True


def findMatchingElem(el, eList):
for elem in eList:
if elemsEqual(el, elem):
return elem
return None


def elemsEqual(el1, el2):
if el1.tag != el2.tag or el1.attrib != el2.attrib:
return False
# no requirement for text checking for now
#if el1.text != el2.text or el1.tail != el2.tail:
#return False
path1 = el1.getroottree().getpath(el1)
path2 = el2.getroottree().getpath(el2)
idxRE = re.compile(r"(\[\d*\])")
path1 = idxRE.sub("", path1)
path2 = idxRE.sub("", path2)
if path1 != path2:
return False

return True

Notice that if documents are in exact same order, each element is
compared only once!
 
S

Stefan Behnel

code_berzerker said:
If document order doesn't matter, try sorting the elements of each level in
the two documents by some arbitrary deterministic key, such as (tag name,
text, attr count, whatever), and then compare them in order, instead of trying
to find matches in multiple passes. itertools.groupby() might be your friend here.

I think that sorting multiple times by each attribute will cost more
than I've managed to do: [...]
let1 = [x for x in et1.iter()]
let2 = [x for x in et2.iter()]
[...]
while let1:
el = let1.pop(0)
foundEl = findMatchingElem(el, let2)
if foundEl is None:
return False
let2.remove(foundEl)
return True

def findMatchingElem(el, eList):
for elem in eList:
if elemsEqual(el, elem):
return elem
return None [...]
Notice that if documents are in exact same order, each element is
compared only once!

Not in your code.

Stefan
 
C

code_berzerker

Not in your code.

Not sure what you mean, but I tested and so far every document with
the same order of elements had number of comparisons equal to number
of nodes.
 
S

Stefan Behnel

code_berzerker said:
Not sure what you mean, but I tested and so far every document with
the same order of elements had number of comparisons equal to number
of nodes.

Sorry, missed the "let2.remove(foundEl)" line.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top