Comparing huge XML Files

J

junnuthala

Hello,

Can someone please suggest me a Perl module for comparing huge XML
Files.

I tried XML::SemanticDiff, but it is taking a lots and lots of time to
load the XML File nodes, elements and attributes to the Hash.

Any suggestions would be really appreciated.

Thank you
-Venkat
 
X

xhoster

junnuthala said:
Hello,

Can someone please suggest me a Perl module for comparing huge XML
Files.

How about just an ordinary text-based file diff on them?

I tried XML::SemanticDiff, but it is taking a lots and lots of time to
load the XML File nodes, elements and attributes to the Hash.

Is it CPU limited or is it swapping itself silly?

Comparing huge XML files requires a huge amount of work, and
that takes a huge amount of resources, which often translates into
long processing times. I doubt that finding another Perl module is
going to change that, unless it makes use of additional constraints on
the organization of the XML files to be compared, which you haven't given
us any information about.

Xho
 
J

Jürgen Exner

How about just an ordinary text-based file diff on them?

Bin there, done that, not a good idea.
XML is a format-free language while text-based diff tools mark every swapped
parameter, every additional line break or space as a difference. Totally
useless to compare XML files.

jue
 
X

xhoster

Jürgen Exner said:
Bin there, done that, not a good idea.
XML is a format-free language while text-based diff tools mark every
swapped parameter, every additional line break or space as a difference.
Totally useless to compare XML files.

While this is true of XML files in general, it is not obvious that the OP
wants to compare XML files in general. He probably wants to compare very
specific XML files, and it may be reasonable to expect that two XML files
generated by the same software will not have arbitrarily swapped parameters
or arbitrarily changes in the white space layout. Only the OP can tell us
if this is true or not. In that absense of feedback to the contrary, I
think that it is at least worth a try.

Xho
 
T

Tad McClellan

Jürgen Exner said:
Bin there, done that, not a good idea.
XML is a format-free language while text-based diff tools mark every swapped
parameter, every additional line break or space as a difference. Totally
useless to compare XML files.


.... unless you normalize them first.
 
J

junnuthala

While this is true of XML files in general, it is not obvious that the OP
wants to compare XML files in general. He probably wants to compare very
specific XML files, and it may be reasonable to expect that two XML files
generated by the same software will not have arbitrarily swapped parameters
or arbitrarily changes in the white space layout. Only the OP can tell us
if this is true or not. In that absense of feedback to the contrary, I
think that it is at least worth a try.


I want to compare XML files generated by two different softwares, but
should be in same format. So I cannot take it for granted that all the
white spaces will be the same, so I cannot use text diff tools.
 
J

John Bokma

junnuthala said:
I want to compare XML files generated by two different softwares, but
should be in same format. So I cannot take it for granted that all the
white spaces will be the same, so I cannot use text diff tools.

diff --help

-i --ignore-case Consider upper- and lower-case to be the same.
-w --ignore-all-space Ignore all white space.
-b --ignore-space-change Ignore changes in the amount of white space.
-B --ignore-blank-lines Ignore changes whose lines are all blank.
-I RE --ignore-matching-lines=RE Ignore changes whose lines all match
RE.

etc.
 
J

junnuthala

I don't want to use the Unix diff.

I wanted to parse and read the XML elements, attributes text into a
tree or a hash and then compare.

I tried using Semanticdiff, but it is taking a lot of time to read the
XML file into Hash.
 
J

junnuthala

Thanks for all the replies.

But for a 6MB XML file, having more than 300,000 elements, XML::parser
module is taking almost 35 minutes to get the parsed result as a tree.

Any suggestions why the XML::parser is taking so much time to parse a
moderately big file?

-Venkat
 
X

xhoster

junnuthala said:
Thanks for all the replies.

But for a 6MB XML file, having more than 300,000 elements,

That is only 20 bytes per element. It is hard to imagine XML with any
degree of sophistication using so little space per element.
XML::parser
module is taking almost 35 minutes to get the parsed result as a tree.

This is an absurdly long time. I can parse 2 3.6MB files having 100,000
elements each in under a minute (2GHz P4, linux). It uses about 100MB of
memory to do so. Either you are swapping badly, you have an ancient
machine, or there is something pathological about your system.
Any suggestions why the XML::parser is taking so much time to parse a
moderately big file?

I think parts of the XML::parser can fall back onto pure Perl if some of
the binary libraries it needs are not installed. Make sure it is not doing
this. Make sure it is not running out of real RAM.

Xho
 
J

junnuthala

In: <[email protected]>,

You might want it in event-driven mode, I don't know about 300,000 elements,
but I've seen it saw through very large XML documents at blazing speed using
the event driven model. (especially in cases where you're only interested in
the attributes, but thats probably not the case here)

Here's a hint for speed: Only use the callbacks you actually need.

Listening to an event will cause it to jump out of it's compiled code and into
your perl code. Leaving callbacks undefined (unless you really need them) will
avoid this step.

Try taking a pass at it w/out any callbacks turned on, then introduce your
callbacks to find the bottlenecks.

If you really need in memory trees, could you maybe break the document down
into several smaller ones? It *might* be faster to invent your own tree
structures in this case, something optimized for read-only access. (I've done
this before but it's kind of time consuming, really only useful in extreme
cases, like if you need to compare over and over and over)

Jamie


The bottleneck is not in the XML::parser when I use the "Stream" option
which returns all the tags in XML format itself.

But when I use the "Tree" options and it is processing each tag I am
getting much delay.

I guess I have to use "Stream" option and do my own callback functions
for startTag, endTag, startDocument and endDocument.

anyone have any suggestions on what would be the fastest way ?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top