REXML ... performance & memory usage ...

Jeff Wood · Nov 3, 2006

Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
it currently eats almost 800Mb of ram before it seems to do anything ...

Does anybody have any tips on getting REXML to run faster and/or smaller
???

I know it's slow just because it's pure ruby ... and there's a lot going
on ... but ... I can sit here for many minutes just waiting for ANY
console output showing that it's actually gotten to the first
root.elements.each( xpath_expr ) iteration ...

Hints/Tips are/would be VERY much appreciated.

Thanks in advance.

jd

Tom Werner · Nov 3, 2006

Jeff said:
Does anybody have any tips on getting REXML to run faster and/or
smaller ???

If having a pure ruby parser is not a requirement and you're on *nix,
then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

Tom

Jeff Wood · Nov 4, 2006

Tom said:
If having a pure ruby parser is not a requirement and you're on *nix,
then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

Tom

I had to make two fixes to the source to get things to compile

ruby_xml_parser.c & ruby_xml_document.c both needed to have #include
"stdargs.h" included ... the compiler wasn't happy about trying to deal
with the va_list data type without it.

But, it's compiling now ... just thought I'd pass the information along
for ya.

After modifying my script to use the libxml binding ... it's sitting @
about 220M used instead of 800+M ... ( better ) ... and does only take
10-20 seconds to start iterating over data ...

So, thank you for the pointer ...

jd

David Vallner · Nov 4, 2006

--------------enig5ACDE43A92F05694B92113AC
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Jeff said:
Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...=

it currently eats almost 800Mb of ram before it seems to do anything ..= =2E
=20

At that file size, I'd also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there's the premature optimization quote that
says to wait with that just yet.

David Vallner

--------------enig5ACDE43A92F05694B92113AC
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFS+Fsy6MhrS8astoRAkurAJwKmmRIfd2RJFgTETW29vpoIS9orQCeJi5q
tBcFrGTvbLNVsY/a33O+e+0=
=pNE2
-----END PGP SIGNATURE-----

--------------enig5ACDE43A92F05694B92113AC--

Devin Mullins · Nov 4, 2006

Jeff said:
After modifying my script to use the libxml binding ... it's sitting @
about 220M used instead of 800+M ... ( better ) ... and does only take
10-20 seconds to start iterating over data ...

WOW.

You might try optimizing your XPath query. I'm no expert at this (or
even knowledgeable), but I did find in the past that changing the XPath
sometimes made a drastic difference in performance.

Devin

Chilkat Software · Nov 4, 2006

Jeff,

I recently ported the (freeware) Chilkat XML parser to Ruby, but it only runs
on Windows. I'm curious to see how it performs in comparison. Do you have
a simple example w/ data that I can use to convert to Chilkat
XML? I'll be happy
to write the code...

Best Regards,
Matt Fausey

Robert Klemme · Nov 4, 2006

David said:
At that file size, I'd also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there's the premature optimization quote that
says to wait with that just yet.

I would not necessarily call that premature optimization. If these
kinds of files are to be parsed frequently and if only a portion of them
needs extracting then I would also go down the stream parser road.

From my experience stream parsers are also appropriate if you have to
transform the XML tree of a document into some other object structure.
IMHO the coding effort for transforming a DOM into another object tree
vs. doing the same with the stream approach is quite equivalent. And
runtime wise you save yourself one whole tree traversal by going stream.

Kind regards

robert

Ross Bamford · Nov 4, 2006

I had to make two fixes to the source to get things to compile

ruby_xml_parser.c & ruby_xml_document.c both needed to have #include
"stdargs.h" included ... the compiler wasn't happy about trying to deal
with the va_list data type without it.

It's a good job I try to keep up with happenings on ruby-talk

Thanks
for posting about this - it's fixed in CVS now.

Also, given your input data, you might be interested to know that I'm
currently working on a developmental branch for libxml-ruby 0.4, which
includes a new, faster SAX callback interface (among many other
changes). The branch name is DEV_0_4, and it's getting to be quite
stable now.

Also, we have a mailing list:

http://rubyforge.org/mail/?group_id=494

Thanks again,

Tomasz Wegrzanowski · Nov 8, 2006

Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
it currently eats almost 800Mb of ram before it seems to do anything ...

Does anybody have any tips on getting REXML to run faster and/or smaller
???

I know it's slow just because it's pure ruby ... and there's a lot going
on ... but ... I can sit here for many minutes just waiting for ANY
console output showing that it's actually gotten to the first
root.elements.each( xpath_expr ) iteration ...

Hints/Tips are/would be VERY much appreciated.

magic/xml has extremely convenient stream parsing interface.
It's based on REXML so it's pretty slow, but it handles hundreds of
MBs big XMLs using just a few MBs of memory.

The idea is simple - you give it a block, and the block
keeps getting incomplete subtrees. It can either decide
to complete the current subtree (all children read to memory),
or to get inside it.

It's something like:

XML.parse_as_twigs(STDIN) {|node|
next unless node.name ==

age
node.complete! # Read all children of <page>...</page> node
t = node[

title] #

title is a child
i = node[

id] #

id is another child
print "#{i}: #{t}\n"
}

A short tutorial at http://zabor.org/taw/magic_xml/tutorial.html

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl's XML::Twig).

Enjoy

Jeff Wood · Nov 8, 2006

Tomasz said:
Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
it currently eats almost 800Mb of ram before it seems to do anything ...

Does anybody have any tips on getting REXML to run faster and/or smaller
???

I know it's slow just because it's pure ruby ... and there's a lot going
on ... but ... I can sit here for many minutes just waiting for ANY
console output showing that it's actually gotten to the first
root.elements.each( xpath_expr ) iteration ...

Hints/Tips are/would be VERY much appreciated.

Click to expand...

magic/xml has extremely convenient stream parsing interface.
It's based on REXML so it's pretty slow, but it handles hundreds of
MBs big XMLs using just a few MBs of memory.

The idea is simple - you give it a block, and the block
keeps getting incomplete subtrees. It can either decide
to complete the current subtree (all children read to memory),
or to get inside it.

It's something like:

XML.parse_as_twigs(STDIN) {|node|
next unless node.name == age
node.complete! # Read all children of <page>...</page> node
t = node[title] # title is a child
i = node[id] # id is another child
print "#{i}: #{t}\n"
}

A short tutorial at http://zabor.org/taw/magic_xml/tutorial.html

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl's XML::Twig).

Enjoy

Thanks for the tip, I'll have to take a look...

jd

Marcus Bristav · Nov 9, 2006

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl's XML::Twig).

Back in the world of j... there are these libs (nux and dom4j and
probably more). They let you stream parse and register callbacks to
xpath expressions. Whenever a registered xpath is encountered it
invokes the callback for that xpath using a dom object (not w3c
DOM...) for the complete sub tree. This is very convenient and raises
the abstraction a bit (the xpath part) from what seems to be your
approach. They don't allow full xpath but only those parts that make
sense in this context.

Anyways, look into it, it's very nice.

/Marcus

ps. I think XML processing tools sucks quite a bit in Ruby (I love
Ruby...). You cannot do high performance processing in a cross
platform way (as far as I know). Libxml on *nix or MSXML on win (since
REXML sucks perfomance wise). It's kind of sad. Is it impossible to
make libxml/libxsl work on Windows?

David Vallner · Nov 9, 2006

--------------enig5316345671CF19BB43964BD1
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Marcus said:
Back in the world of j...

*groan*
*facedesk*
*moan*

Right, can ANYONE explain this braindead fad to me?

Hint: No matter what some of the more loudmouthed bloggers would like to
insinuate in the massive ongoing circlejerk of FUD (from both the Ruby
and the Java side of things):

A) There is no conspiracy of panicking Java (yes, that IS the word)
developers desperately trying to eradicate Ruby in fear for their jobs

B) Having more advanced development tools doesn't increase your penis
size nor girth

C) Being able to code without advanced development tools doesn't
increase your penis size nor girth

D) Blog commenters that swoon over keypress count comparisons aren't
visionaries that have Seen The Truth, they're hapless muppets without
much attention span and too much time on their hands, people that get
actual work can tell what's completely irrelevant to actual practice and
so much waste of webspace and bandwidth

E) Ruby won't kill Java, Java won't kill Ruby, C# won't kill Java, Ruby
won't kill Python, Ajax won't kill the desktop, ActiveRecord won't kill
Hibernate, Rails won't kill Rife, Rife won't kill Rails...

F) No matter how long, or with which fervency you'll compare apples to
oranges, they won't taste equally good to all people

</rant>

Now, is there any chance the general audience of this mailing list will
ever be able to mention other programming languages for the sake of
comparison without in some way indicating revilement of such or
reluctance to do so?

David Vallner

PS: I wonder how many people will see this considering points B and C
are likely to send spam filters into a hissy fit.

--------------enig5316345671CF19BB43964BD1
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFUxfQy6MhrS8astoRAr63AJ92OuP40mBZWoien4BPnhWfyruCiwCeNskL
5NB9WcU+zYJW/fs5zyQVc1I=
=q1Kw
-----END PGP SIGNATURE-----

--------------enig5316345671CF19BB43964BD1--

Paul Battley · Nov 9, 2006

If having a pure ruby parser is not a requirement and you're on *nix,
then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

I can vouch for that. I changed a bit of slow code from using REXML to
libxml, with fairly minor alterations. The work didn't take long, and
it made a huge difference:

REXML: 0.539 seconds
libxml: 0.012 seconds

Paul.

Ross Bamford · Nov 9, 2006

B) Having more advanced development tools doesn't increase your penis
size nor girth

Dammit! Another six-hundred quid down the drain...

performance problem when running java applications overnight	10	Oct 15, 2004
ANN: home_run 0.9.1 Released	12	Sep 1, 2010
Ruby Weekly News 5th - 11th June 2006	0	Jun 14, 2006
[ANN] JRuby 1.2.0 Released	1	Mar 16, 2009
[ANN] JRuby 1.1.5 Released	5	Nov 3, 2008
ANN: ThirdBase: A Fast and Easy Date/DateTime Class for Ruby	0	Nov 22, 2008
aspnet_wp.exe - relation heap mem to virtual memory	0	Aug 1, 2003
[ANN] JRuby 1.4.0 Released	2	Nov 2, 2009

REXML ... performance & memory usage ...

Jeff Wood

Tom Werner

Jeff Wood

David Vallner

Devin Mullins

Chilkat Software

Robert Klemme

Ross Bamford

Tomasz Wegrzanowski

Jeff Wood

Marcus Bristav

David Vallner

Paul Battley

Ross Bamford

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads