lxml, comparing nodes

code_berzerker · Jul 23, 2008

I'd like to know if there is any built in mechanism in lxml that lets
you check equality of two nodes from separate documents. I'd like it
to ignore attribute order and so on. It would be even better if there
was built in method for checking equality of whole documents (ignoring
document order). Please let me know if you know of such method or
existing scipt. I dont like reinventing the wheel

Sebastian \lunar\ Wiesner · Jul 23, 2008

code_berzerker said:
I'd like to know if there is any built in mechanism in lxml that lets
you check equality of two nodes from separate documents. I'd like it
to ignore attribute order and so on. It would be even better if there
was built in method for checking equality of whole documents (ignoring
document order). Please let me know if you know of such method or
existing scipt. I dont like reinventing the wheel

Did you try the equality operator?

Stefan Behnel · Jul 23, 2008

code_berzerker said:
I'd like to know if there is any built in mechanism in lxml that lets
you check equality of two nodes from separate documents.

No, because, as you state yourself, equality is not something that everyone
defines the same way.

I'd like it
to ignore attribute order and so on. It would be even better if there
was built in method for checking equality of whole documents (ignoring
document order). Please let me know if you know of such method or
existing scipt. I dont like reinventing the wheel

Your requirements for a single Element are simple enough to write it in three
to five lines of Python code (depending on your definition of equality).
Checking this equality recursively is another two to three lines. Not complex
enough to be considered a wheel in the first place.

Stefan

code_berzerker · Jul 23, 2008

Your requirements for a single Element are simple enough to write it in three
to five lines of Python code (depending on your definition of equality).
Checking this equality recursively is another two to three lines. Not complex
enough to be considered a wheel in the first place.

Forgive my ignorance as I am new to both Python and lxml

Fredrik Lundh · Jul 23, 2008

code_berzerker said:
Forgive my ignorance as I am new to both Python and lxml

off the top of my head (untested):
.... if a.tag != b.tag or a.attrib != b.attrib:
.... return False
.... if a.text != b.text or a.tail != b.tail:
.... return False
.... if len(a) != len(b):
.... return False
.... if any(not equal(a, b) for a, b in zip(a, b)):
.... return False
.... return True

this should work for arbitrary ET implementations (lxmk, xml.etree, ET,
etc). tweak as necessary.

</F>

code_berzerker · Jul 24, 2008

off the top of my head (untested):

>>> def equal(a, b):
... if a.tag != b.tag or a.attrib != b.attrib:
... return False
... if a.text != b.text or a.tail != b.tail:
... return False
... if len(a) != len(b):
... return False
... if any(not equal(a, b) for a, b in zip(a, b)):
... return False
... return True

this should work for arbitrary ET implementations (lxmk, xml.etree, ET,
etc). tweak as necessary.

</F>

Thanks for help. Thats inspiring, tho not exactly what I need, coz
ignoring document order is requirement (ignoring changes in order of
different siblings of the same type, etc). I plan to try something
like that:

def xmlCmp(xmlStr1, xmlStr2):
et1 = etree.XML(xmlStr1)
et2 = etree.XML(xmlStr2)

queue = []
tmpq = deque([et1])
tmpq2 = deque([et2])

while tmpq:
el = tmpq.popleft()
tmpq.extend(el)
queue.append(el.tag)

while queue:
el = queue.pop()
foundEl = findMatchingElem(el, et2)
if foundEl:
et1.remove(el)
tmpq2.remove(foundEl)
else:
return False

if len(tmpq2) == 0:
return True
else:
return False

def findMatchingElem(el, eTree):
for elem in eTree:
if elemCmp(el, elem):
return elem
return None

def elemCmp(el1, el2):
pass # yet to be implemented

Stefan Behnel · Jul 24, 2008

code_berzerker said:
Thanks for help. Thats inspiring, tho not exactly what I need, coz
ignoring document order is requirement (ignoring changes in order of
different siblings of the same type, etc). I plan to try something
like that:

def xmlCmp(xmlStr1, xmlStr2):
et1 = etree.XML(xmlStr1)
et2 = etree.XML(xmlStr2)

queue = []
tmpq = deque([et1])
tmpq2 = deque([et2])

while tmpq:
el = tmpq.popleft()
tmpq.extend(el)
queue.append(el.tag)

while queue:
el = queue.pop()
foundEl = findMatchingElem(el, et2)
if foundEl:
et1.remove(el)
tmpq2.remove(foundEl)
else:
return False

if len(tmpq2) == 0:
return True
else:
return False

If document order doesn't matter, try sorting the elements of each level in
the two documents by some arbitrary deterministic key, such as (tag name,
text, attr count, whatever), and then compare them in order, instead of trying
to find matches in multiple passes. itertools.groupby() might be your friend here.

Stefan

code_berzerker · Jul 25, 2008

If document order doesn't matter, try sorting the elements of each level in

the two documents by some arbitrary deterministic key, such as (tag name,
text, attr count, whatever), and then compare them in order, instead of trying
to find matches in multiple passes. itertools.groupby() might be your friend here.

I think that sorting multiple times by each attribute will cost more
than I've managed to do:

from lxml import etree
from collections import deque
import string, re, time

def xmlEqual(xmlStr1, xmlStr2):
et1 = etree.XML(xmlStr1)
et2 = etree.XML(xmlStr2)

let1 = [x for x in et1.iter()]
let2 = [x for x in et2.iter()]

if len(let1) != len(let2):
return False

while let1:
el = let1.pop(0)
foundEl = findMatchingElem(el, let2)
if foundEl is None:
return False
let2.remove(foundEl)
return True

def findMatchingElem(el, eList):
for elem in eList:
if elemsEqual(el, elem):
return elem
return None

def elemsEqual(el1, el2):
if el1.tag != el2.tag or el1.attrib != el2.attrib:
return False
# no requirement for text checking for now
#if el1.text != el2.text or el1.tail != el2.tail:
#return False
path1 = el1.getroottree().getpath(el1)
path2 = el2.getroottree().getpath(el2)
idxRE = re.compile(r"(\[\d*\])")
path1 = idxRE.sub("", path1)
path2 = idxRE.sub("", path2)
if path1 != path2:
return False

return True

Notice that if documents are in exact same order, each element is
compared only once!

Stefan Behnel · Jul 25, 2008

code_berzerker said:
If document order doesn't matter, try sorting the elements of each level in
the two documents by some arbitrary deterministic key, such as (tag name,
text, attr count, whatever), and then compare them in order, instead of trying
to find matches in multiple passes. itertools.groupby() might be your friend here.

Click to expand...

I think that sorting multiple times by each attribute will cost more
than I've managed to do: [...]
let1 = [x for x in et1.iter()]
let2 = [x for x in et2.iter()]
[...]
while let1:
el = let1.pop(0)
foundEl = findMatchingElem(el, let2)
if foundEl is None:
return False
let2.remove(foundEl)
return True

def findMatchingElem(el, eList):
for elem in eList:
if elemsEqual(el, elem):
return elem
return None [...]
Notice that if documents are in exact same order, each element is
compared only once!

Not in your code.

Stefan

code_berzerker · Jul 25, 2008

Not in your code.

Stefan

Not sure what you mean, but I tested and so far every document with
the same order of elements had number of comparisons equal to number
of nodes.

Stefan Behnel · Jul 25, 2008

code_berzerker said:
Not sure what you mean, but I tested and so far every document with
the same order of elements had number of comparisons equal to number
of nodes.

Sorry, missed the "let2.remove(foundEl)" line.

Stefan

lxml precaching DTD for document verification.	3	Nov 27, 2011
Problem inserting an element where I want it using lxml	2	Jan 5, 2011
lxml/ElementTree and .tail	30	Nov 15, 2006
[ANN] lxml 1.0 released	2	Jun 2, 2006
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
Choosing the right epsilon for comparing doubles	8	Feb 2, 2014
Data saving in condition of changing reality	0	Apr 29, 2022
iterate over a series of nodes in an XML file	4	Jul 5, 2006

lxml, comparing nodes

code_berzerker

Sebastian \lunar\ Wiesner

Stefan Behnel

code_berzerker

Fredrik Lundh

code_berzerker

Stefan Behnel

code_berzerker

Stefan Behnel

code_berzerker

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads