XML diff/merge standard

A

Andreas Kasparek

Hola!

I'm preparing my master thesis about a XML Merge Tool implementation and was
wondering if there is any open standard for XML diff regarding topics like:

- is a diff result computed on the ordered or unordered xml node tree of
the compared documents?
- what identifiers/criteria should be used by default to match elements of
the same type in different documents?
- should a diff tool consider move operations or only insert/delete
(besides update)?
- is there any common default behavior that a user would expect from a xml
diff/merge tool?


I've searched the web (especially the W3C site) and newsgroups but couldn't
find anything useful (or I was too blind). I was told this topic is a common
one in this newsgroup but I didn't find anything reasonable here either.

There are a lot of diff programs and papers on the web but none of them seem
to reference or follow any standard. It's not a question that most tools
have to be configured properly by the user to fit the special structure and
semantic of one's specific xml files, so that it will yield optimal results.
But is there a standard behavior (say like in sorting a list in alphabetical
order) that one might expect? There are a lot of xml files out there and so
there have to be users of xml diff/merge tools who surely expect some
specific behavior from these tools, or aren't they? So has anyone taken the
time to write a draft or something about how to handle xml diffs/merges in a
default way?

I don't want help how to implement such a tool or what algorithms there are
but I'm looking for some recommendations on how such a tool should behave in
a default setting without regarding any user preferences.

Any pointers to some (official?) documents/websites/books/groups/conferences
would be nice. Thanks!


Hasta lügo
Andreas


P.S.: Excuse my unqualified use of OE as Newsreader, but I'm constrained to
it here. Oh, and this posting is merely a private affair, nothing which
represents the owner of my sender domain in any kind :)
 
?

=?ISO-8859-15?Q?R=E9mi_Peyronnet?=

Hi,
I'm preparing my master thesis about a XML Merge Tool implementation and was
wondering if there is any open standard for XML diff regarding topics like:
[...Snip...]

As these topics are really dependant of what the user needs and what the
diff software choose to implement, I cannot see how it could be
normalized. (but some standards on the output format exist : XDL (Used
by Microsoft Diff & Patch), DUL (Used by diffxml))
There are a lot of diff programs and papers on the web but none of them seem
to reference or follow any standard. It's not a question that most tools
have to be configured properly by the user to fit the special structure and
semantic of one's specific xml files, so that it will yield optimal results.

Most of the diff tools I have seen are quite generic and do not depend
on the xml structure of the files. But the user have to choose the right
tools / options (spaces, orders,....) for his needs. This is quite the
same thing with the classic "diff" tool, which has many options.

Hth,
 
A

Andreas Kasparek

Hola!

Rémi Peyronnet said:
As these topics are really dependant of what the user needs and what the
diff software choose to implement, I cannot see how it could be
normalized.
Surely things like the ordering of elements or what attributes should be
regarded as IDs are highly dependant on the user's documents, but aren't
there any assumptions on i.e. whether elements could have moved between
subtrees or just have been deleted/inserted, whether an element is matching
one from the other document because it has the same attribute or because it
bears the same child-nodes? I tried some free xml diff tools and often I got
different results without preconfiguring them. And in some cases I really
wondered why the diff result was as it was and not any other way.
I think it would be nice to have some common default behavior one might
expect, to base the own configuration decisions on. How could I know how to
set my preferences if I have to play around with a tool at first to find out
its normal mode of operation?

Lets look at a short example:

Doc1:

<root>
<node>
<subnode>
Text
</subnode>
</node>
<node foo="bar">
Text
</node>
</root>

----------------------

Doc2:

<root>
<node foo="bar">
<subnode>
Text
<subnode>
</node>
<node>
Text
</node>
</root>


So what two elements from both documents are considered to be matching? Is
it the first node of both docs (and respectively the second also), because
they comprise a similar subtree of (sub-/text-)nodes? Or is it the second
node of Doc1 to the first node of Doc2, because they contain the same
attribute (and that the same value, maybe even marked as ID in any schema)
each? And if the ordering of the node-elements has been reversed, then was
the subnode part moved from the none-foo node to the foo-node (meaning it is
in fact the same subnode element and not just a similar one)? Or was it
deleted on the one node and inserted to the other node (is there any case
where that matters?)?

This is a really simple and straightforward example, I'm sure one could
construct more complex structures to show some kind of ambiguity that can't
be resolved as easily.


(but some standards on the output format exist : XDL (Used
by Microsoft Diff & Patch), DUL (Used by diffxml))
And at least a handful others :)

Most of the diff tools I have seen are quite generic and do not depend
on the xml structure of the files. But the user have to choose the right
tools / options (spaces, orders,....) for his needs. This is quite the
same thing with the classic "diff" tool, which has many options.
Ok, that's right.


Thanks!

Hasta lügo
Andreas
 
?

=?ISO-8859-15?Q?R=E9mi_Peyronnet?=

Hi Andreas,
Surely things like the ordering of elements or what attributes should be
regarded as IDs are highly dependant on the user's documents, but aren't
there any assumptions on i.e. whether elements could have moved between
subtrees or just have been deleted/inserted, whether an element is matching
one from the other document because it has the same attribute or because it
bears the same child-nodes? I tried some free xml diff tools and often I got
different results without preconfiguring them.

Of course it would be possible to define a default behaviour of a xml
diff tool, but I cannot see what it could improve :

In my mind, what decided authors to write different free xml diff tools
is that they did not find any other that matches their needs ; there is
too many ways to diff xml files according to your needs :
- a systematic diff (as does microsoft's one and xmldiff), which
reports addition, deletion,... Quite useful with a software, but quite
unusable straight by a human.
- a "minimal differences" diff, (as does ssddiff). Usefull for text
edition (xhtml & co)
- a diff specialized for large xml files with the same structure, (as
does libxmldiff).

The three respond to different needs, and therefore implements
completely different algorithms. If you define a standart behaviour,
that implies at least that all implementations can cope with this
behaviour, ie. implements the standart algorithm.

The dream would be that all implementation implement all the possible
algorithms according all the needs a user could have, but well, that
seems somewhat unrealistic :)
And in some cases I really
wondered why the diff result was as it was and not any other way.
How could I know how to
set my preferences if I have to play around with a tool at first to find out
its normal mode of operation?

Read the documentation ? :)
Lets look at a short example: [Snip]

This is a really simple and straightforward example, I'm sure one could
construct more complex structures to show some kind of ambiguity that can't
be resolved as easily.

Quite true, that proves that xml diff introduces many more
user-dependant kind of needs than a classic diff.
 
E

erich.schubert

Hi,
Note that I'm the author of ssddiff, so my opinion is biased.
- a "minimal differences" diff, (as does ssddiff). Usefull for text
edition (xhtml & co)

Other tools also try to do minimal differences; except usually an
approximation is preferable due to speed and memory reasons. the
ssddiff prototype also has an option to do that but this isn't really
optimized yet (I still have some ideas to improve quality in the fast
mode);
ssddiff also has three different output formats, one is basically a
merge, one is an xupdate diff script (similar to other xml diff
applications, but a rather well-defined standard; whereas many xml diff
applications use a proprietary format, sometimes not even xml itself,
that may or may not have actual patch applications for it...) and the
third one is labelling both source documents so a third party can
further process the "best match" to e.g. produce another output format.

ssddiff is NOT optimal for xhtml. Because changes in XHTML are often
within "strings" of the document (sorry for using the programmers
vocabulary, and not the XML vocabulary: cdata); on the other hand it
can work very well on some cases of XHTML because this format has a
very flexible structure (and the structure is an important part of the
information), whereas in an address book, the structure is next to
meaningless, and except for the ordering a pure classic LCS diff does
the job just fine.

So the big difference with ssddiff is that it tries to do a *semantic*
diff on the structure, whereas other diff applications basically do a
text diff, on tokens instead of lines. (granted, some also do an
inside-out matching and such)
The ssddiff approach could (not the current prototype, but a
straighforward extension of it) also match data in two different xml
formats, or RDF data. Or even diff XML and RDF data.
It doesn't rely on the document to be a tree structure.
(For best results, the user has to specify which relations in the
document contain relevant information)

to the original questions:
- is a diff result computed on the ordered or unordered xml node tree of
the compared documents?

If you specify that the following-sibling::* relation is part of the
structural information, matching pays respect to the ordering of
children. Note that in XML by default the ordering is significant,
although in many applications it is not; there is no "ignore-ordering"
attribute in XML, this has to be handled by applications.
Also you must differentiate between ignoring the ordering in
calculating the diff or in calculating the edit script. By default
ssddiff will ignore ordering of children (unless you give an
appropriate xpath), the xupdate and merge output writers however will
try to reconstruct the original ordering.
- what identifiers/criteria should be used by default to match elements of
the same type in different documents?

This depends very much on your application. You have data where there
are rarely ever changes within string parts, and you can have data
where XML is a mere container and all the difference happens in there.
The ssddiff approach isn't of much use in cases where the structure of
the documents is very restricted and fixed (such as relational
databases exported to XML); the prototype is next to useless since it
only support string equalty.
Other implementations might want to add substring or levenstein
matches.
- should a diff tool consider move operations or only insert/delete
(besides update)?

This depends a lot on what you allow as insert and delete operations,
and what your user base is. If you make diffs for patching XML files,
an insert/delete only diff may be useful.
if the diff is to be read by humans, moves are very useful.
If you do not allow subtree insertions and deletions (note that if I
allow subtree insertions and deletions I can replace a document in just
two operations!), a move can safe much resources.
On merges, moves are also much more useful (when detected correctly),
but can also cause confusion (when an independent deletion and
insertion are incorrectly made a move)
- is there any common default behavior that a user would expect from a xml
diff/merge tool?

It just works?
Well, it depends a lot on wheter the intended audience for the file is
a human reader or not.
Have a look at the harry potter books example in the ssddiff
publication at eXtreme Markup Languages 2005 (available online)
together with slide 27 in
http://ssddiff.alioth.debian.org/talk-extreme-markup-languages.pdf
(the colors make it easier to read; but the explanaition was mostly
given orally, so use the publication itself for the comments)
This is an example where there is no way for a computer to decide which
is best, he has to rely on additional information by the user (such as:
the ISBN of a book is it's unique identifier and should never change);
sometimes this could be derived from DTD/Schema information.

Regards,
Erich Schubert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top