XSLT Compare two documents and output differences

S

super.raddish

Greetings,

I am relatively new to, what I would call, advanced XSLT/XPath and I
am after some advice from those in the know. I am attempting to figure
out a mechanism within XSLT to compare the difference between two
source documents and output node-sets which are "different" (changed
or new) to new XML files using xsl:result-document

To describe the problem I have provided some example data below along
with my a portion of my current XSLT. I have changed the meaning of
the data to make it less specific to my project just in case the
suggestions we get here prove useful to others.

OK, so problem is as follows:

- We have a source document "SourceData.xml" containing a catalogue
of "Fish" provided for us by a partner so that we can update our
internal databases.

- The process requires that we take each <datarecord> node and parse
it into our internal format using our naming conventions

- We also have to perform a replacement against their "location"
element which does not map to our "habitat" values. I have done this
by loading a lookup file called "DataMapping.xml" into a global
variable. I then assign an xsl:key to the @clientname attribute of the
<entry> element. When I need to get the value I grab the clients value
into a variable, switch to the lookup documents context using the
xsl:for-each trick and then perform a lookup using key(x,y).

- Each <datarecord> node in the Source will produce a new xml file
containing a single <updateRecord> element with our structure beneath

All of this works fine (oddly enough) and we have been quite impressed
with how XSLT handles all this. HOWEVER, we have just been told that
the partner who supplies our Source XML is not able to filter the
records they send us to only contain those new or recently modified,
in fact that have to send us pretty much their entire database. There
is no option for them to change this and to make matters worse the
source file could grow to upwards of 50,000 records, making it over
120MB.

I have been asked to look at ways to compare the previous days Source
XML against the one coming in and output only those records which are
new or have changed. I am currently doing this in the code warping the
XSLT Transformation, but it's going to get real slow when there are
50k records.

The rules are:

- Both documents will be an identical structure
- Both documents will have ~95% the same content
- The source document <datarecord> has a compound key to make it
unique <species> + <subspecies>
- A modified record consists of any change to the payload value of
the elements within the <datarecord>'s
- A new record is obviously one not found in the previous days XML
- We only want to produce either a single XML containing new or
modified records *OR* incorporate the required XSLT into our current
GenerateDataSegments.xsl

I have been thinking about with loading one document as the source and
then document() to load the previous filename (passed as a Global
Param), but frankly I'm a little lost as to how to attack it after
that.

If the answer is that there is no decent way of doing this in XSLT
without killing the load on the machine, does anyone know of a fully
automatable Command Line tool or Service that can do the "compare and
output differences" bit ? Open Source or Commercial is fine by me. for
the record, I'm currently using the latest build of Saxon-B


<!-- SourceData.xml -->

<?xml version="1.0" encoding="UTF-8"?>
<main>
<datarecord>
<species>23</species>
<subspecies>23</subspecies>
<location>Pacific</location>
<name>Blue Bopper Fish</name>
</datarecord>
<datarecord>
<species>23</species>
<subspecies>25</subspecies>
<location>Indian</location>
<name>Purple Bopper Fish</name>
</datarecord>
<datarecord>
<species>17</species>
<subspecies>3</subspecies>
<location>Atlantic</location>
<name>Ringed Oaf Fish</name>
</datarecord>
...
</main>


<!-- DataMapping.xml -->

<?xml version="1.0" encoding="UTF-8"?>
<mapping>
<mapsection name="oceans">
<entry clientname="Pacific" internalname="Pacific Ocean">
<entry clientname="Atlantic" internalname="Atlantic Ocean">
<entry clientname="Indian" internalname="Indian Ocean">
<entry clientname="Southern" internalname="Southern Ocean">
</mapsection>
</mapping>


<!-- GenerateDataSegments.xsl -->

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:param name="outputPath" />
<xsl:variable name="dataMapping"
select="document('DataMapping.xml')" />
<xsl:key name="oceans" match="mapsection[@name='oceans']/
entry" use="@clientname" />
<xsl:template match="/">
<xsl:for-each select="main/datarecord">
<xsl:result-document href="file:///{$outputPath}-
{count(ancestor::node()|preceding::*)}.xml" >
<updateRecord>
<family><xsl:value-of select="species" /></family>
<genus><xsl:value-of select="subspecies" /></genus>
<habitat>
<xsl:variable name="clientHabitat" select="location" />
<xsl:for-each select="$dataMapping">
<xsl:value-of select="key('oceans', $clientHabitat)/
@internalname"/>
</xsl:for-each>
</habitat>
<fullname><xsl:value-of select="name" /></fullname>
</updateRecord>
</xsl:result-document>
</xsl:for-each>
</xsl:stylesheet>


<!-- PreviousSourceData.xml - Missing one record and value changed in
another-->

<?xml version="1.0" encoding="UTF-8"?>
<main>
<datarecord>
<species>23</species>
<subspecies>25</subspecies>
<location>Southern</location>
<name>Purple Bopper Fish</name>
</datarecord>
<datarecord>
<species>17</species>
<subspecies>3</subspecies>
<location>Atlantic</location>
<name>Ringed Oaf Fish</name>
</datarecord>
...
</main>

Thanks in advance for your time and assistance,

Al
 
D

delirio

Greetings,

I am relatively new to, what I would call, advanced XSLT/XPath and I
am after some advice from those in the know. I am attempting to figure
out a mechanism within XSLT to compare the difference between two
source documents and output node-sets which are "different" (changed
or new) to new XML files using xsl:result-document

To describe the problem I have provided some example data below along
with my a portion of my current XSLT. I have changed the meaning of
the data to make it less specific to my project just in case the
suggestions we get here prove useful to others.

OK, so problem is as follows:

- We have a source document "SourceData.xml" containing a catalogue
of "Fish" provided for us by a partner so that we can update our
internal databases.

- The process requires that we take each <datarecord> node and parse
it into our internal format using our naming conventions

- We also have to perform a replacement against their "location"
element which does not map to our "habitat" values. I have done this
by loading a lookup file called "DataMapping.xml" into a global
variable. I then assign an xsl:key to the @clientname attribute of the
<entry> element. When I need to get the value I grab the clients value
into a variable, switch to the lookup documents context using the
xsl:for-each trick and then perform a lookup using key(x,y).

- Each <datarecord> node in the Source will produce a new xml file
containing a single <updateRecord> element with our structure beneath

All of this works fine (oddly enough) and we have been quite impressed
with how XSLT handles all this. HOWEVER, we have just been told that
the partner who supplies our Source XML is not able to filter the
records they send us to only contain those new or recently modified,
in fact that have to send us pretty much their entire database. There
is no option for them to change this and to make matters worse the
source file could grow to upwards of 50,000 records, making it over
120MB.

I have been asked to look at ways to compare the previous days Source
XML against the one coming in and output only those records which are
new or have changed. I am currently doing this in the code warping the
XSLT Transformation, but it's going to get real slow when there are
50k records.

The rules are:

- Both documents will be an identical structure
- Both documents will have ~95% the same content
- The source document <datarecord> has a compound key to make it
unique <species> + <subspecies>
- A modified record consists of any change to the payload value of
the elements within the <datarecord>'s
- A new record is obviously one not found in the previous days XML
- We only want to produce either a single XML containing new or
modified records *OR* incorporate the required XSLT into our current
GenerateDataSegments.xsl

I have been thinking about with loading one document as the source and
then document() to load the previous filename (passed as a Global
Param), but frankly I'm a little lost as to how to attack it after
that.

If the answer is that there is no decent way of doing this in XSLT
without killing the load on the machine, does anyone know of a fully
automatable Command Line tool or Service that can do the "compare and
output differences" bit ? Open Source or Commercial is fine by me. for
the record, I'm currently using the latest build of Saxon-B

<!-- SourceData.xml -->

<?xml version="1.0" encoding="UTF-8"?>
<main>
<datarecord>
<species>23</species>
<subspecies>23</subspecies>
<location>Pacific</location>
<name>Blue Bopper Fish</name>
</datarecord>
<datarecord>
<species>23</species>
<subspecies>25</subspecies>
<location>Indian</location>
<name>Purple Bopper Fish</name>
</datarecord>
<datarecord>
<species>17</species>
<subspecies>3</subspecies>
<location>Atlantic</location>
<name>Ringed Oaf Fish</name>
</datarecord>
...
</main>

<!-- DataMapping.xml -->

<?xml version="1.0" encoding="UTF-8"?>
<mapping>
<mapsection name="oceans">
<entry clientname="Pacific" internalname="Pacific Ocean">
<entry clientname="Atlantic" internalname="Atlantic Ocean">
<entry clientname="Indian" internalname="Indian Ocean">
<entry clientname="Southern" internalname="Southern Ocean">
</mapsection>
</mapping>

<!-- GenerateDataSegments.xsl -->

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:param name="outputPath" />
<xsl:variable name="dataMapping"
select="document('DataMapping.xml')" />
<xsl:key name="oceans" match="mapsection[@name='oceans']/
entry" use="@clientname" />
<xsl:template match="/">
<xsl:for-each select="main/datarecord">
<xsl:result-document href="file:///{$outputPath}-
{count(ancestor::node()|preceding::*)}.xml" >
<updateRecord>
<family><xsl:value-of select="species" /></family>
<genus><xsl:value-of select="subspecies" /></genus>
<habitat>
<xsl:variable name="clientHabitat" select="location" />
<xsl:for-each select="$dataMapping">
<xsl:value-of select="key('oceans', $clientHabitat)/
@internalname"/>
</xsl:for-each>
</habitat>
<fullname><xsl:value-of select="name" /></fullname>
</updateRecord>
</xsl:result-document>
</xsl:for-each>
</xsl:stylesheet>

<!-- PreviousSourceData.xml - Missing one record and value changed in
another-->

<?xml version="1.0" encoding="UTF-8"?>
<main>
<datarecord>
<species>23</species>
<subspecies>25</subspecies>
<location>Southern</location>
<name>Purple Bopper Fish</name>
</datarecord>
<datarecord>
<species>17</species>
<subspecies>3</subspecies>
<location>Atlantic</location>
<name>Ringed Oaf Fish</name>
</datarecord>
...
</main>

Thanks in advance for your time and assistance,

Al

you could try using the node assertion mechanics of XSLT Unit (http://
xsltunit.org/#notEqual)

<xsltu:test id="test-title">
<xsl:call-template name="xsltu:assertEqual">
<xsl:with-param name="id" select="'full-value'"/>
<xsl:with-param name="nodes1">
<xsl:apply-templates select="document('library.xml')/
library/book[isbn='0836217462']/title"/>
</xsl:with-param>
<xsl:with-param name="nodes2">
<h1>Being a Dog Is a Full-Time Job</h1>
</xsl:with-param>
</xsl:call-template>
</xsltu:test>
 
S

super.raddish

you could try using the node assertion mechanics of XSLT Unit (http://
xsltunit.org/#notEqual)

<xsltu:test id="test-title">
<xsl:call-template name="xsltu:assertEqual">
<xsl:with-param name="id" select="'full-value'"/>
<xsl:with-param name="nodes1">
<xsl:apply-templates select="document('library.xml')/
library/book[isbn='0836217462']/title"/>
</xsl:with-param>
<xsl:with-param name="nodes2">
<h1>Being a Dog Is a Full-Time Job</h1>
</xsl:with-param>
</xsl:call-template>
</xsltu:test>

I am trying not to use an extensions. I ended up using the following,
which works perfectly.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">
<xsl:eek:utput method="xml" indent="yes" />
<xsl:param name="fileCurrentPath" />
<xsl:param name="filePreviousPath" />
<xsl:variable name="fileCurrent"
select="document($fileCurrentPath, /)" />
<xsl:variable name="filePrevious"
select="document($filePreviousPath, /)" />
<xsl:template match="/">
<main>
<xsl:apply-templates select="$fileCurrent//datarecord"
mode="addedchanged"/>
</main>
</xsl:template>
<xsl:template match="//datarecord" mode="addedchanged" >
<xsl:variable name="varSpecies" select="species"/>
<xsl:variable name="varSubspecies" select="subspecies"/>
<xsl:choose>
<xsl:when test="$filePrevious//datarecord[species=$varSpecies]
[subspecies=$varSubspecies]">
<xsl:if test="not(.=$filePrevious//datarecord[species=
$varSpecies][subspecies=$varSubspecies])">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:when>
<xsl:eek:therwise>
<xsl:copy-of select="."/>
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
 
S

super.raddish

you could try using the node assertion mechanics of XSLT Unit (http://
xsltunit.org/#notEqual)

<xsltu:test id="test-title">
<xsl:call-template name="xsltu:assertEqual">
<xsl:with-param name="id" select="'full-value'"/>
<xsl:with-param name="nodes1">
<xsl:apply-templates select="document('library.xml')/
library/book[isbn='0836217462']/title"/>
</xsl:with-param>
<xsl:with-param name="nodes2">
<h1>Being a Dog Is a Full-Time Job</h1>
</xsl:with-param>
</xsl:call-template>
</xsltu:test>

I am trying not to use an extensions. I ended up using the following,
which works perfectly.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">
<xsl:eek:utput method="xml" indent="yes" />
<xsl:param name="fileCurrentPath" />
<xsl:param name="filePreviousPath" />
<xsl:variable name="fileCurrent"
select="document($fileCurrentPath, /)" />
<xsl:variable name="filePrevious"
select="document($filePreviousPath, /)" />
<xsl:template match="/">
<main>
<xsl:apply-templates select="$fileCurrent//datarecord"
mode="addedchanged"/>
</main>
</xsl:template>
<xsl:template match="//datarecord" mode="addedchanged" >
<xsl:variable name="varSpecies" select="species"/>
<xsl:variable name="varSubspecies" select="subspecies"/>
<xsl:choose>
<xsl:when test="$filePrevious//datarecord[species=$varSpecies]
[subspecies=$varSubspecies]">
<xsl:if test="not(.=$filePrevious//datarecord[species=
$varSpecies][subspecies=$varSubspecies])">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:when>
<xsl:eek:therwise>
<xsl:copy-of select="."/>
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
 
D

delirio

you could try using the node assertion mechanics of XSLT Unit (http://
xsltunit.org/#notEqual)
<xsltu:test id="test-title">
<xsl:call-template name="xsltu:assertEqual">
<xsl:with-param name="id" select="'full-value'"/>
<xsl:with-param name="nodes1">
<xsl:apply-templates select="document('library.xml')/
library/book[isbn='0836217462']/title"/>
</xsl:with-param>
<xsl:with-param name="nodes2">
<h1>Being a Dog Is a Full-Time Job</h1>
</xsl:with-param>
</xsl:call-template>
</xsltu:test>

I am trying not to use an extensions. I ended up using the following,
which works perfectly.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">
<xsl:eek:utput method="xml" indent="yes" />
<xsl:param name="fileCurrentPath" />
<xsl:param name="filePreviousPath" />
<xsl:variable name="fileCurrent"
select="document($fileCurrentPath, /)" />
<xsl:variable name="filePrevious"
select="document($filePreviousPath, /)" />
<xsl:template match="/">
<main>
<xsl:apply-templates select="$fileCurrent//datarecord"
mode="addedchanged"/>
</main>
</xsl:template>
<xsl:template match="//datarecord" mode="addedchanged" >
<xsl:variable name="varSpecies" select="species"/>
<xsl:variable name="varSubspecies" select="subspecies"/>
<xsl:choose>
<xsl:when test="$filePrevious//datarecord[species=$varSpecies]
[subspecies=$varSubspecies]">
<xsl:if test="not(.=$filePrevious//datarecord[species=
$varSpecies][subspecies=$varSubspecies])">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:when>
<xsl:eek:therwise>
<xsl:copy-of select="."/>
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>

what I meant was that the xsl code in xsltu already has a template
that compares nodes, any node to any other node. You may want to reuse
that template. This is not an extension, they just replaced the
convaentional xsl prefix with xsltu for pratical reasons, but the code
underneeth is conventional xsl . . .

Have a look at the code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top