My experience with XML::DOM VS XML::LibXML

  • Thread starter Jahagirdar Vijayvithal S
  • Start date

J

Jahagirdar Vijayvithal S

I have a script where I am
1> Opening a pipe to a program which reads in a binary file(400MB) and dumps out XML data(XXX GB's) (tethereal)
2> Grabing chunk's of data within tags <packet>.....</packet> (approx 4
to 20 K)
3> parsing the XML
4> post processing based on fields and attributes in the XML document.

Initially I used XML::DOM and found that my memory consumption
constantly increased filling up the entire RAM and SWAP space before
crashing(approx 32GB RAM and 160+ GB Swap consumed).
Switching to XML:LibXML and replacing the XML::DOM constructs with their
equivalent I find that my worst case Memory consumption remains below
2GB each (RAM and swap) and average is around 50 MB each.

While my problem is solved I am curious to know wether there is any
known Issues which caused the above problems?

code fragment used by me is as below
-----------------------------Code-----------------
use XML::DOM;
my $parser = XML::DOM::parser->new();
open XML ,"$tethereal -r $pcapfile -T pdml|" or die "Cant open a simple pipe? go smoke one!";
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){
if($range==1){
$data="<JVS_PARSER>";
}
$data="$data $_";
if ($range=~/E0/){
$data="$data</JVS_PARSER>";
......... various calls to function getval and Other processingstuff
}
}
}

sub getval(){
my ($data,$name,$attribute)[email protected]_;
return unless defined($data);
my $doc = $parser->parse($data);#Should be parsing this just once outside the function call
foreach my $element ($doc->getElementsByTagName('field')){
if ($element->getAttribute('name') eq $name){
return $element->getAttribute($attribute);
}
}
return -1; #Error
}

--------------------------End Code-------------------------

Regards
Jahagirdar Vijayvithal S
--
 
Ad

Advertisements

R

robic0

I have a script where I am
1> Opening a pipe to a program which reads in a binary file(400MB) and dumps out XML data(XXX GB's) (tethereal)
2> Grabing chunk's of data within tags <packet>.....</packet> (approx 4
to 20 K)
So the data within tags contains the xml you want to parse?
3> parsing the XML within the packet tags
4> post processing based on fields and attributes in the XML document. which document?

Initially I used XML::DOM and found that my memory consumption
constantly increased filling up the entire RAM and SWAP space before
crashing(approx 32GB RAM and 160+ GB Swap consumed).
Switching to XML:LibXML and replacing the XML::DOM constructs with their
equivalent I find that my worst case Memory consumption remains below
2GB each (RAM and swap) and average is around 50 MB each.
use SAX, nodes will fall off the ends of the earth
While my problem is solved I am curious to know wether there is any
known Issues which caused the above problems?

code fragment used by me is as below
-----------------------------Code-----------------
use XML::DOM;
my $parser = XML::DOM::parser->new();
open XML ,"$tethereal -r $pcapfile -T pdml|" or die "Cant open a simple pipe? go smoke one!";
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){
^ ^^^
bad Perl? (...) capture?
what would happen if:
<packet>adsfbasdfbabf</packet><packet>adsfbasdfbabf
</packet><packet>adsfbasdfbabf</packet>
<packet>adsfbasdfbabf
if($range==1){
will it even get here if $range == 0?
$data="<JVS_PARSER>";
}
$data="$data $_";
what are you grabbing here? i thought you just needed packet data?
if ($range=~/E0/){
$data="$data</JVS_PARSER>";
........ various calls to function getval and Other processingstuff
}
}
}

sub getval(){ sub getval(???){ ## ?
my ($data,$name,$attribute)[email protected]_;
return unless defined($data);
my $doc = $parser->parse($data);#Should be parsing this just once outside the function call
foreach my $element ($doc->getElementsByTagName('field')){
if ($element->getAttribute('name') eq $name){
return $element->getAttribute($attribute);
}
}
return -1; #Error
}

--------------------------End Code-------------------------

Regards
Jahagirdar Vijayvithal S

Don't know how you got anything to work on this.
You would be better off if you use SAX to start with.
I know the whole XXX gigabyte attracts attention to you
question but you should give more details.
 
R

robic0

use SAX, nodes will fall off the ends of the earth

These are a good general use, high performance strategy, node
alternative for xml:

use XML::Xerces;
use XML::parser::Expat;
use XML::Simple;
$XML::Simple::pREFERRED_PARSER = 'XML::parser';

Once you setup a template for SAX parsing processing,
you can use it on any xml no matter what the structure.
And I'm talking about quad-terrabyte xml file that
you can process with 512 megs of ram.
 
J

Jahagirdar Vijayvithal S

* robic0 said:
So the data within tags contains the xml you want to parse? Right.
within the packet tags right again
which document?
Oops sorry for the confusion I meant based on elements and attributes
one additional observation is that the XML::DOM would crash after
processing around 6K such packets where as XML::LibXML was able to
handle more than .6G packets
since the only thing that changed was the line
use XML::DOM;
my $parser = XML::DOM::parser->new();
and the function getval();

and the changes were that of replacing these lines by their equivalent
in XML::LibXML I was wondering what could be the problem. while my
initial guess is that of some sort of memory leak......
use SAX, nodes will fall off the ends of the earth
^ ^^^
bad Perl? (...) capture?
bad perl: why?
.... => range operator with evaluation of RHS pattern in the next cycle.
and $range capturing the sequence number.
what would happen if:
<packet>adsfbasdfbabf</packet><packet>adsfbasdfbabf
</packet><packet>adsfbasdfbabf</packet>
<packet>adsfbasdfbabf
</packet>
While theoritically possible the tool converting the binary data to XML
always dumps out one element per line e.g.
<packet>
<proto name="geninfo" pos="0" showname="General information" size="114">
<field name="num" pos="0" show="1" showname="Number" value="1" size="114"/>
<field name="len" pos="0" show="114" showname="Packet Length" value="72" size="114"/>
<field name="caplen" pos="0" show="114" showname="Captured Length" value="72" size="114"/>
<field name="timestamp" pos="0" show="Sep 30, 2005 11:34:22.158787000" showname="Captured Time" value="1128060262.158787000" size="114"/>
</proto>
.....
Thats the funny thing about xml, you can edit it in notepad.
will it even get here if $range == 0?
no, refer perldoc perlop for range operator.
what are you grabbing here? i thought you just needed packet data?
Thats right. I am reading each line from XML, as long as it falls
between the said:
sub getval(???){ ## ?
prototype declared at top of file (not included in code snippet)
Don't know how you got anything to work on this.
I hope the comments above would clarify some of your doubts.
You would be better off if you use SAX to start with.
while I have been following the SAX vs ... debate in other thread. I was
already familiar with XML::DOM and had a code base using DOM to build
on. which facilitated a quick workable code which could be used to
perform the actual processing I am interested in.
I know the whole XXX gigabyte attracts attention to you
question but you should give more details.

Regards
Jahagirdar Vijayvithal S
--
 
A

Ala Qumsieh

robic0 said:
On Thu, 1 Dec 2005 10:49:38 +0530, Jahagirdar Vijayvithal S
[snip]
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){

^ ^^^
bad Perl? (...) capture?

No, this is good Perl. This is not a regexp, but a range operator.

--Ala
 
R

robic0

robic0 said:
On Thu, 1 Dec 2005 10:49:38 +0530, Jahagirdar Vijayvithal S
[snip]
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){

^ ^^^
bad Perl? (...) capture?

No, this is good Perl. This is not a regexp, but a range operator.

--Ala

Oh you know what, I must be going blind. Let me look at this whole
thing again. Give me a few minutes.
 
Ad

Advertisements

R

robic0

Oops sorry for the confusion I meant based on elements and attributes

one additional observation is that the XML::DOM would crash after
processing around 6K such packets where as XML::LibXML was able to
handle more than .6G packets
since the only thing that changed was the line
use XML::DOM;
my $parser = XML::DOM::parser->new();
and the function getval();

and the changes were that of replacing these lines by their equivalent
in XML::LibXML I was wondering what could be the problem. while my
initial guess is that of some sort of memory leak......
bad perl: why?
... => range operator with evaluation of RHS pattern in the next cycle.
and $range capturing the sequence number.
While theoritically possible the tool converting the binary data to XML
always dumps out one element per line e.g.
<packet>
<proto name="geninfo" pos="0" showname="General information" size="114">
<field name="num" pos="0" show="1" showname="Number" value="1" size="114"/>
<field name="len" pos="0" show="114" showname="Packet Length" value="72" size="114"/>
<field name="caplen" pos="0" show="114" showname="Captured Length" value="72" size="114"/>
<field name="timestamp" pos="0" show="Sep 30, 2005 11:34:22.158787000" showname="Captured Time" value="1128060262.158787000" size="114"/>
</proto>

Quite a few attributes there. Looks good so far. Let me
look it over for a while.
 
R

robic0

* robic0 <robic0> wrote: [snip]
^ ^^^
bad Perl? (...) capture?
bad perl: why?
... => range operator with evaluation of RHS pattern in the next cycle.
and $range capturing the sequence number.
Ok, think I got how your using the range operator here (from perlops):

It is false as long as its left operand is false.
Once the left operand is true, the range operator stays true until
the right operand is true, AFTER which the range operator becomes
false again.
[snip]
If you don't want it to test the right operand till the next
evaluation, as in sed,
just use three dots (``...'') instead of two.

- check your other stuff, brb...
 
R

robic0

* robic0 <robic0> wrote: [snip]
-----------------------------Code-----------------
use XML::DOM;
my $parser = XML::DOM::parser->new();
open XML ,"$tethereal -r $pcapfile -T pdml|" or die "Cant open a simple pipe? go smoke one!";
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){
^ ^^^
bad Perl? (...) capture?
bad perl: why?
... => range operator with evaluation of RHS pattern in the next cycle.
and $range capturing the sequence number.
Ok, think I got how your using the range operator here (from perlops):

It is false as long as its left operand is false.
Once the left operand is true, the range operator stays true until
the right operand is true, AFTER which the range operator becomes
false again.
[snip]
If you don't want it to test the right operand till the next
evaluation, as in sed,
just use three dots (``...'') instead of two.

- check your other stuff, brb...
But this is scary. you had better hope there's only one element
per line..... still checking
 
R

robic0

* robic0 <robic0> wrote: [snip]
-----------------------------Code-----------------
use XML::DOM;
my $parser = XML::DOM::parser->new();
open XML ,"$tethereal -r $pcapfile -T pdml|" or die "Cant open a simple pipe? go smoke one!";
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){
^ ^^^
bad Perl? (...) capture?
bad perl: why?
... => range operator with evaluation of RHS pattern in the next cycle.
and $range capturing the sequence number.
Ok, think I got how your using the range operator here (from perlops):

It is false as long as its left operand is false.
Once the left operand is true, the range operator stays true until
the right operand is true, AFTER which the range operator becomes
false again.
[snip]
If you don't want it to test the right operand till the next
evaluation, as in sed,
just use three dots (``...'') instead of two.

- check your other stuff, brb...
But this is scary. you had better hope there's only one element
per line..... still checking

Forgot to include this comment last time:
In SAX you don't have to preprocess like your doing here.
You could start parsing without doing this, even though your
looking for inner xml structures. Thats not necessary with SAX.
But using dom, thats the way you have to do it, to reduce the
memory consumption by taking the "whole" into consideration.
........ more to come.
 
R

robic0

On Thu, 1 Dec 2005 15:58:17 +0530, Jahagirdar Vijayvithal S

* robic0 <robic0> wrote:
[snip]
-----------------------------Code-----------------
use XML::DOM;
my $parser = XML::DOM::parser->new();
open XML ,"$tethereal -r $pcapfile -T pdml|" or die "Cant open a simple pipe? go smoke one!";
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){
^ ^^^
bad Perl? (...) capture?
bad perl: why?
... => range operator with evaluation of RHS pattern in the next cycle.
and $range capturing the sequence number.
Ok, think I got how your using the range operator here (from perlops):

It is false as long as its left operand is false.
Once the left operand is true, the range operator stays true until
the right operand is true, AFTER which the range operator becomes
false again.
[snip]
If you don't want it to test the right operand till the next
evaluation, as in sed,
just use three dots (``...'') instead of two.

- check your other stuff, brb...
But this is scary. you had better hope there's only one element
per line..... still checking

Forgot to include this comment last time:
In SAX you don't have to preprocess like your doing here.
You could start parsing without doing this, even though your
looking for inner xml structures. Thats not necessary with SAX.
But using dom, thats the way you have to do it, to reduce the
memory consumption by taking the "whole" into consideration.
....... more to come.
Again forgot. Let me give you an example. You have a 4 gig xml
file. You only want to extract data in random, 4k slices xml.
What you are trying to do with preprocessing here (to reduce)
the dom overhead is done automatically with SAX stream
processing. You just set the handlers to filter the data you
want to an array of structures or database. If in your case
the aggregate xml is extremely large, and your only interrested
in a few blocks of xml (which itself is very large) only SAX
can do that. You set up simple handlers (callbacks) for start
tag, end tag, content (and by the way for every other w3c
constuct). Glean you tag data.

In your case the data your interrested in has a container tag
"<packet>" (I won't go into the case of unclosed nested tags)
which spans a series of tags:

<packet>
<proto name="geninfo" pos="0" showname="General information"
size="114">
<field name="num" pos="0" show="1" showname="Number" value="1"
size="114"/>
<field name="len" pos="0" show="114" showname="Packet Length"
value="72" size="114"/>
<field name="caplen" pos="0" show="114" showname="Captured Length"
value="72" size="114"/>
<field name="timestamp" pos="0" show="Sep 30, 2005 11:34:22.158787000"
showname="Captured Time" value="1128060262.158787000" size="114"/>
</proto>
.....
</packet>

Once the "<packet>" start tag is flagged, you can start collecting
tag data (attributes and content) into a structure that can be
serialized out to a database. You do this until the end tag
"</packet>" is reached.

SAX is event driven, totally serialized. Its so fast you can't
believe it.
In fact your example is so simple I could write the data structures
and the custum part of the template model in less than a day.
Including creating database records. The only limit would be
on the amount of hard drive space.

You have to learn SAX, especially on such a large model. The rudiments
of the event handlers can be for the most part be templated for all
data models you want to extract.

If you want a good template, or will be doing more of this.
I could provide it to you on a contract basis. I've done alot of
SAX and know the ins and outs extremely well.

With SAX you have granularity down to the &amp level.
Let me know what you decide.

Hope this helps....
gluck

dr
 
Ad

Advertisements

R

robic0

On Fri, 02 Dec 2005 22:36:28 -0800, robic0 wrote:

On Thu, 1 Dec 2005 15:58:17 +0530, Jahagirdar Vijayvithal S

* robic0 <robic0> wrote:
[snip]
-----------------------------Code-----------------
use XML::DOM;
my $parser = XML::DOM::parser->new();
open XML ,"$tethereal -r $pcapfile -T pdml|" or die "Cant open a simple pipe? go smoke one!";
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){
^ ^^^
bad Perl? (...) capture?
bad perl: why?
... => range operator with evaluation of RHS pattern in the next cycle.
and $range capturing the sequence number.
Ok, think I got how your using the range operator here (from perlops):

It is false as long as its left operand is false.
Once the left operand is true, the range operator stays true until
the right operand is true, AFTER which the range operator becomes
false again.
[snip]
If you don't want it to test the right operand till the next
evaluation, as in sed,
just use three dots (``...'') instead of two.

- check your other stuff, brb...

But this is scary. you had better hope there's only one element
per line..... still checking

Forgot to include this comment last time:
In SAX you don't have to preprocess like your doing here.
You could start parsing without doing this, even though your
looking for inner xml structures. Thats not necessary with SAX.
But using dom, thats the way you have to do it, to reduce the
memory consumption by taking the "whole" into consideration.
....... more to come.
Again forgot. Let me give you an example. You have a 4 gig xml
file. You only want to extract data in random, 4k slices xml.
What you are trying to do with preprocessing here (to reduce)
the dom overhead is done automatically with SAX stream
processing. You just set the handlers to filter the data you
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Actually a clarification here. You can set any handler you want.
To that end, when you have flagged that you are in the right
location, you can set custom handlers (your subs) to process
new incoming tags, etc... The ways to do this programatically
could include an array of custom handlers. Its very simple
and in actuality there is no reason not to have content
wrap. In other words. There is NO way to fool good code at all.
Its about as simple as it can be.... Let me know -gluck
 
R

robic0

On Fri, 02 Dec 2005 22:38:22 -0800, robic0 wrote:

On Fri, 02 Dec 2005 22:36:28 -0800, robic0 wrote:

On Thu, 1 Dec 2005 15:58:17 +0530, Jahagirdar Vijayvithal S

* robic0 <robic0> wrote:
[snip]
-----------------------------Code-----------------
use XML::DOM;
my $parser = XML::DOM::parser->new();
open XML ,"$tethereal -r $pcapfile -T pdml|" or die "Cant open a simple pipe? go smoke one!";
while(<XML>){
#print;
if(my $range=/<packet>/.../<\/packet>/){
^ ^^^
bad Perl? (...) capture?
bad perl: why?
... => range operator with evaluation of RHS pattern in the next cycle.
and $range capturing the sequence number.
Ok, think I got how your using the range operator here (from perlops):

It is false as long as its left operand is false.
Once the left operand is true, the range operator stays true until
the right operand is true, AFTER which the range operator becomes
false again.
[snip]
If you don't want it to test the right operand till the next
evaluation, as in sed,
just use three dots (``...'') instead of two.

- check your other stuff, brb...

But this is scary. you had better hope there's only one element
per line..... still checking

Forgot to include this comment last time:
In SAX you don't have to preprocess like your doing here.
You could start parsing without doing this, even though your
looking for inner xml structures. Thats not necessary with SAX.
But using dom, thats the way you have to do it, to reduce the
memory consumption by taking the "whole" into consideration.
....... more to come.
Again forgot. Let me give you an example. You have a 4 gig xml
file. You only want to extract data in random, 4k slices xml.
What you are trying to do with preprocessing here (to reduce)
the dom overhead is done automatically with SAX stream
processing. You just set the handlers to filter the data you
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Actually a clarification here. You can set any handler you want.
To that end, when you have flagged that you are in the right
location, you can set custom handlers (your subs) to process
new incoming tags, etc... The ways to do this programatically
could include an array of custom handlers. Its very simple
and in actuality there is no reason not to have content
wrap. In other words. There is NO way to fool good code at all.
Its about as simple as it can be.... Let me know -gluck
want to an array of structures or database. If in your case
the aggregate xml is extremely large, and your only interrested
in a few blocks of xml (which itself is very large) only SAX
can do that. You set up simple handlers (callbacks) for start
tag, end tag, content (and by the way for every other w3c
constuct). Glean you tag data.

In your case the data your interrested in has a container tag
"<packet>" (I won't go into the case of unclosed nested tags)
which spans a series of tags:

<packet>
<proto name="geninfo" pos="0" showname="General information"
size="114">
<field name="num" pos="0" show="1" showname="Number" value="1"
size="114"/>
<field name="len" pos="0" show="114" showname="Packet Length"
value="72" size="114"/>
<field name="caplen" pos="0" show="114" showname="Captured Length"
value="72" size="114"/>
<field name="timestamp" pos="0" show="Sep 30, 2005 11:34:22.158787000"
showname="Captured Time" value="1128060262.158787000" size="114"/>
</proto>
....
</packet>

Once the "<packet>" start tag is flagged, you can start collecting
tag data (attributes and content) into a structure that can be
serialized out to a database. You do this until the end tag
"</packet>" is reached.

SAX is event driven, totally serialized. Its so fast you can't
believe it.
In fact your example is so simple I could write the data structures
and the custum part of the template model in less than a day.
Including creating database records. The only limit would be
on the amount of hard drive space.

You have to learn SAX, especially on such a large model. The rudiments
of the event handlers can be for the most part be templated for all
data models you want to extract.

If you want a good template, or will be doing more of this.
I could provide it to you on a contract basis. I've done alot of
SAX and know the ins and outs extremely well.

With SAX you have granularity down to the &amp level.
Let me know what you decide.

Hope this helps....
gluck

dr
If you want me to write the core or whole of you project, I can do
that too. But I know you guys are programmers. Just could use a good
master to work from...
 
J

Jahagirdar Vijayvithal S

* robic0 said:
Forgot to include this comment last time:
In SAX you don't have to preprocess like your doing here.
You could start parsing without doing this, even though your
looking for inner xml structures. Thats not necessary with SAX.
But using dom, thats the way you have to do it, to reduce the
memory consumption by taking the "whole" into consideration.
....... more to come.
One of the reasons I did not go for SAX was it would increase the code
size for e.g currently the actual code for parsing the XML is within if
(/<packet>/.../<\/packet>/) block and getvalue function (<45lines). The restof the
code is doing the actual function for which the script was written.
I have not written code using SAX before but from what I understand I
will have to write start-end tag handler for each and ever possible tags
which in my case goes in hundreds of tags.

Regards
Jahagirdar Vijayvithal S
 
J

Jahagirdar Vijayvithal S

* robic0 said:
Again forgot. Let me give you an example. You have a 4 gig xml
file. You only want to extract data in random, 4k slices xml.
What you are trying to do with preprocessing here (to reduce)
the dom overhead is done automatically with SAX stream
processing. You just set the handlers to filter the data you
want to an array of structures or database. If in your case
the aggregate xml is extremely large, and your only interrested
in a few blocks of xml (which itself is very large) only SAX
can do that. You set up simple handlers (callbacks) for start
tag, end tag, content (and by the way for every other w3c
constuct). Glean you tag data.

In your case the data your interrested in has a container tag
"<packet>" (I won't go into the case of unclosed nested tags)
which spans a series of tags:
A few additional details on what I am trying to do. I am trying to build
an engine to collect statistics on packet sequences of interest each xml
stream may contain upto a million packet, each packet enclosed in a
<packet>
<proto name="geninfo" pos="0" showname="General information"
size="114">
<field name="num" pos="0" show="1" showname="Number" value="1"
size="114"/>
<field name="len" pos="0" show="114" showname="Packet Length"
value="72" size="114"/>
<field name="caplen" pos="0" show="114" showname="Captured Length"
value="72" size="114"/>
<field name="timestamp" pos="0" show="Sep 30, 2005 11:34:22.158787000"
showname="Captured Time" value="1128060262.158787000" size="114"/>
</proto>
....
</packet>

Once the "<packet>" start tag is flagged, you can start collecting
tag data (attributes and content) into a structure that can be
serialized out to a database. You do this until the end tag
"</packet>" is reached.
Isnt this similar to what I get from my use of DOM?
SAX is event driven, totally serialized. Its so fast you can't
believe it.
In fact your example is so simple I could write the data structures
and the custum part of the template model in less than a day.
Including creating database records. The only limit would be
on the amount of hard drive space.

You have to learn SAX, especially on such a large model. The rudiments
of the event handlers can be for the most part be templated for all
data models you want to extract.

If you want a good template, or will be doing more of this.
I could provide it to you on a contract basis. I've done alot of
SAX and know the ins and outs extremely well.

With SAX you have granularity down to the &amp level.
Let me know what you decide.

Hope this helps....
gluck

dr

Regards
Jahagirdar Vijayvithal S
 
J

Jahagirdar Vijayvithal S

* robic0 said:
If you want me to write the core or whole of you project, I can do
that too. But I know you guys are programmers. Just could use a good
master to work from...
Thanks for the offer. I have been writing perl scripts in and out for
the past 5+ years for automating most of my code generation and parsing
work.
I already have a working code.
I started this thread because I saw something which to me indicated a
bug/memory leak in the XML::DOM implementation. so wanted to know the
opinion of others here on it.


Regards
Jahagirdar Vijayvithal S
 
Ad

Advertisements

A

A. Sinan Unur

I have a script where I am
1> Opening a pipe to a program which reads in a binary file(400MB) and
dumps out XML data(XXX GB's) (tethereal) 2> Grabing chunk's of data
within tags <packet>.....</packet> (approx 4 to 20 K)
3> parsing the XML
4> post processing based on fields and attributes in the XML document.

Initially I used XML::DOM and found that my memory consumption
constantly increased filling up the entire RAM and SWAP space before
crashing(approx 32GB RAM and 160+ GB Swap consumed).
Switching to XML:LibXML and replacing the XML::DOM constructs with
their equivalent I find that my worst case Memory consumption remains
below 2GB each (RAM and swap) and average is around 50 MB each.

While my problem is solved I am curious to know wether there is any
known Issues which caused the above problems?

code fragment used by me is as below

The answer seems to lie in the documentation:

http://search.cpan.org/~enno/libxml-enno-1.02/lib/XML/Parser/Expat.pod

release

There are data structures used by XML::parser::Expat that have circular
references. This means that these structures will never be garbage
collected unless these references are explicitly broken. Calling this
method breaks those references (and makes the instance unusable.)

http://search.cpan.org/~enno/libxml-enno-1.02/lib/XML/Parser.pod

DESCRIPTION ^

This module provides ways to parse XML documents. It is built on top of
XML::parser::Expat, which is a lower level interface to James Clark's
expat library. Each call to one of the parsing methods creates a new
instance of XML::parser::Expat which is then used to parse the document.

It looks like the XML::parser::Expat instances used to parse each chunk
are not being released. One can see this by running:

Finally:

http://search.cpan.org/~enno/libxml-enno-1.02/lib/XML/DOM.pm

SYNOPSIS ^

# Avoid memory leaks - cleanup circular references for garbage
collection

$doc->dispose;

So compare:

#!/usr/bin/perl

use strict;
use warnings;

use XML::DOM;

my $xml = <<EO_XML;
<person>
<name>John Doe</name>
<phone>21212121215</phone>
<address> 75 Rue de Marshall
Boulevard Attarde
453 France</address>
</person>
EO_XML

while( 1 ) {
my $p = XML::DOM::parser->new;
$p->parse($xml);
}
__END__

Run this and wait for your computer to run out of memory.

However, add the dispose call:

#!/usr/bin/perl

use strict;
use warnings;

use XML::DOM;

my $xml = <<EO_XML;
<person>
<name>John Doe</name>
<phone>21212121215</phone>
<address> 75 Rue de Marshall
Boulevard Attarde
453 France</address>
</person>
EO_XML

while( 1 ) {
my $p = XML::DOM::parser->new;
my $doc = $p->parse($xml);
$doc->dispose;
}
__END__
 
R

robic0

A. Sinan Unur said:
The answer seems to lie in the documentation:

http://search.cpan.org/~enno/libxml-enno-1.02/lib/XML/Parser/Expat.pod

release

There are data structures used by XML::parser::Expat that have circular
references. This means that these structures will never be garbage
collected unless these references are explicitly broken. Calling this
method breaks those references (and makes the instance unusable.)
And is there some reason this can't be done?

## Parse xml and integrity check
open (SAMP, "name.xml") || die;
my $parser = new XML::parser::Expat;
$parser->setHandlers('Start' => \&stag_h,
'End' => \&etag_h,
'Char' => \&cdata_h);
$parser->setHandlers('Comment' => \&comment_h) if (you want it);
.... etc...
eval {$parser->parse(*SAMP)};
if ([email protected]) {
## xml integrity failed
[email protected] =~ s/^[\x20\n\t]+//; [email protected] =~ s/[\x20\n\t]+$//;
print "[email protected]\n";
}
close(SAMP);
$parser->release;
 
R

robic0

A. Sinan Unur said:
The answer seems to lie in the documentation:

http://search.cpan.org/~enno/libxml-enno-1.02/lib/XML/Parser/Expat.pod

release

There are data structures used by XML::parser::Expat that have circular
references. This means that these structures will never be garbage
collected unless these references are explicitly broken. Calling this
method breaks those references (and makes the instance unusable.)
And is there some reason this can't be done?

## Parse xml and integrity check
open (SAMP, "name.xml") || die;
my $parser = new XML::parser::Expat;
$parser->setHandlers('Start' => \&stag_h,
'End' => \&etag_h,
'Char' => \&cdata_h);
$parser->setHandlers('Comment' => \&comment_h) if (you want it);
... etc...
eval {$parser->parse(*SAMP)};
if ([email protected]) {
## xml integrity failed
[email protected] =~ s/^[\x20\n\t]+//; [email protected] =~ s/[\x20\n\t]+$//;
print "[email protected]\n";
}
close(SAMP);
$parser->release;
Same
 
Ad

Advertisements

R

robic0

A. Sinan Unur said:
The answer seems to lie in the documentation:

http://search.cpan.org/~enno/libxml-enno-1.02/lib/XML/Parser/Expat.pod

release

There are data structures used by XML::parser::Expat that have circular
references. This means that these structures will never be garbage
collected unless these references are explicitly broken. Calling this
method breaks those references (and makes the instance unusable.)
And is there some reason this can't be done?

## Parse xml and integrity check
open (SAMP, "name.xml") || die;
my $parser = new XML::parser::Expat;
$parser->setHandlers('Start' => \&stag_h,
'End' => \&etag_h,
'Char' => \&cdata_h);
$parser->setHandlers('Comment' => \&comment_h) if (you want it);
... etc...
eval {$parser->parse(*SAMP)};
if ([email protected]) {
## xml integrity failed
[email protected] =~ s/^[\x20\n\t]+//; [email protected] =~ s/[\x20\n\t]+$//;
print "[email protected]\n";
}
close(SAMP);
$parser->release;
Same
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top