Ruby 1.8.1 REXML performance

S

Steven Jenkins

I have a script that uses REXML to stream parse an XML file and load a
MySQL database from it. The file is about 65000 lines, 2.2 MB. With Ruby
1.8.0, it completes in about 25 seconds. I just installed Ruby 1.8.1,
and the performance is at least a factor of 20 worse.

This is on a 1 GHz P3 running Gentoo Linux. I don't have head-to-head
comparison numbers, but I'm observing the same effect on a 3.2 GHz P4
running RedHat 9. The script runs for 16 minutes when it should have
completed in a few seconds.

I've commented out the database calls, so I'm just measuring the
Ruby/REXML performance.

Any ideas?
 
S

Steven Jenkins

Here's a little more data. I've stripped the application down to a
bare-bones stream parser:

---------------
#!/usr/bin/env ruby

require "rexml/document"
require "rexml/streamlistener"

class Handler
include REXML::StreamListener

def initialize
end

def tag_start(name, attrs)
end

def tag_end(name)
end

def text(t)
end

def cdata(d)
end
end

REXML::Document.parse_stream($stdin, Handler.new)
---------------

I'm running the tests on two different machines because it's a pain to
switch minor versions on the same machine. But I have run tests on the
same machine to verify the results.

To parse a 65,000 line XML file:

machine1 machine2
1 GHz 3.2 GHz
Ruby 1.8.0 Ruby 1.8.1
17 seconds 515 seconds

Steve
 
S

Steven Jenkins

Quick update--I'll post numbers later. It appears that the REXML stream
parser in Ruby 1.8.1 is worse than O(n) in the number of input lines (or
tags). Ruby 1.8.0 appears to be pretty close to O(n) (as it should).

I hope soon I won't be the only person in this thread :).

Steve
 
D

Daniel Berger

Steven Jenkins said:
Quick update--I'll post numbers later. It appears that the REXML stream
parser in Ruby 1.8.1 is worse than O(n) in the number of input lines (or
tags). Ruby 1.8.0 appears to be pretty close to O(n) (as it should).

I hope soon I won't be the only person in this thread :).

Steve

Steve,

One thing you can do is run your code through the profiler (ruby -r
profile yourcode.rb) and show us the results. This could help us
identify a potential bottleneck in REXML and/or Ruby itself.

Regards,

Dan
 
S

Steven Jenkins

Daniel said:
One thing you can do is run your code through the profiler (ruby -r
profile yourcode.rb) and show us the results.

I've been running lots of tests. Unfortunately, the profiler adds a
large O(n) component to the runtime, so the nonlinear behavior doesn't
emerge until larger n. (n is the number of tags in the input XML file.)

I have a lot of data now and I'll have to organize it a little. The
method call counts look fine--they are all linear in n (with a small
constant offset.) One thing that does jump out, however, is that as n
goes from 1000 to 50000, the time per call for String#strip goes
increases by factor of about 10. For n = 1000, String#strip accounted
for about 6% of the total runtime. For n = 50000, it's about 30%.
(Without the profiler, the percentages would be even higher.) That must
mean String#strip is being passed longer arguments for larger n.

String#strip is called exactly the same number of times as:

REXML::parsers::BaseParser#pull
REXML::parsers::BaseParser#has_next?
REXML::parsers::BaseParser#empty?
REXML::IOSource#empty?
REXML::Source#empty?

IOSource#empty? and Source#empty both call String#strip directly, so I
suspect that's where the problem is.

Steve
 
S

Steven Jenkins

I've repeated all the tests with Ruby 1.8.0 and it sure looks like
String#strip is the problem.

To summarize, I have a bare REXML stream parser that reads its input
file and does nothing. I'm handing it an input file that looks like this

<?xml version="1.0" ?>
<multistatus>
<response>
<dsref>0</dsref>
</response>
<response>
<dsref>1</dsref>
</response>
...
</multistatus>

I'm varying the number of <response> elements in the file from 100 to
50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
proportional to n, as it should be. In 1.8.1, the total runtime is
greater (which is bad but not pathological) and worse than linear in n
(which is pathological).

Here's the profile data for strip over all values of n:

time/call % total runtime
min max min max

Ruby 1.8.0 .01 .02 .82 2.48
Ruby 1.8.1 .10 1.21 5.08 29.04

In both cases, the number of calls is proportional to n, although strip
is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
there is a strong increasing trend in the time per call as n increases.

I don't suspect the problem is in strip itself. I suspect the problem is
that REXML creates a string that grows as more input is read and hands
it repeatedly to strip. I looked briefly at the REXML code and didn't
see anything obvious. (Other than the fact that calling strip on a
buffer just to see whether it has any tokens in it is inefficient. That
just makes it slow, however, not nonlinear--unless the buffer is growing.)

I hope this is enough for someone to diagnose and fix the problem. I'll
send the data to anyone who wants it.

Steve
 
L

Lothar Scholz

Hello Steven,

Sunday, February 1, 2004, 6:17:27 PM, you wrote:

SJ> I've repeated all the tests with Ruby 1.8.0 and it sure looks like
SJ> String#strip is the problem.

SJ> To summarize, I have a bare REXML stream parser that reads its input
SJ> file and does nothing. I'm handing it an input file that looks like this

SJ> <?xml version="1.0" ?>
SJ> <multistatus>
SJ> <response>
SJ> <dsref>0</dsref>
SJ> </response>
SJ> <response>
SJ> <dsref>1</dsref>
SJ> </response>
SJ> ...
SJ> </multistatus>

SJ> I'm varying the number of <response> elements in the file from 100 to
SJ> 50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
SJ> proportional to n, as it should be. In 1.8.1, the total runtime is
SJ> greater (which is bad but not pathological) and worse than linear in n
SJ> (which is pathological).

SJ> Here's the profile data for strip over all values of n:

SJ> time/call % total runtime
SJ> min max min max

SJ> Ruby 1.8.0 .01 .02 .82 2.48
SJ> Ruby 1.8.1 .10 1.21 5.08 29.04

SJ> In both cases, the number of calls is proportional to n, although strip
SJ> is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
SJ> there is a strong increasing trend in the time per call as n increases.

SJ> I don't suspect the problem is in strip itself. I suspect the problem is
SJ> that REXML creates a string that grows as more input is read and hands
SJ> it repeatedly to strip. I looked briefly at the REXML code and didn't
SJ> see anything obvious. (Other than the fact that calling strip on a
SJ> buffer just to see whether it has any tokens in it is inefficient. That
SJ> just makes it slow, however, not nonlinear--unless the buffer is growing.)

SJ> I hope this is enough for someone to diagnose and fix the problem. I'll
SJ> send the data to anyone who wants it.

You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.

Yes, no doubt, i'm not an open source fan.
 
M

Martin Weber

(...)
You are the only one talking with you in this thread. Do you really
think that someone else will do your job ?

His job ? it's part of the base system after all.
Do it for free ?

So we should ask for money if you come here and ask for help ?
(.. brainless babble stripped ..)

Yes, no doubt, i'm not an open source fan.

Go use Java on solaris then. (http://www.archub.org/javamemo.txt).
And pay Matz for ruby.

Best regards,

-Martin
 
P

paul vudmaska

Lothar said:
Hello Steven,

Sunday, February 1, 2004, 6:17:27 PM, you wrote:

SJ> I've repeated all the tests with Ruby 1.8.0 and it sure looks like
SJ> String#strip is the problem.

SJ> To summarize, I have a bare REXML stream parser that reads its input
SJ> file and does nothing. I'm handing it an input file that looks like this

SJ> <?xml version="1.0" ?>
SJ> <multistatus>
SJ> <response>
SJ> <dsref>0</dsref>
SJ> </response>
SJ> <response>
SJ> <dsref>1</dsref>
SJ> </response>
SJ> ...
SJ> </multistatus>

SJ> I'm varying the number of <response> elements in the file from 100 to
SJ> 50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
SJ> proportional to n, as it should be. In 1.8.1, the total runtime is
SJ> greater (which is bad but not pathological) and worse than linear in n
SJ> (which is pathological).

SJ> Here's the profile data for strip over all values of n:

SJ> time/call % total runtime
SJ> min max min max

SJ> Ruby 1.8.0 .01 .02 .82 2.48
SJ> Ruby 1.8.1 .10 1.21 5.08 29.04

SJ> In both cases, the number of calls is proportional to n, although strip
SJ> is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
SJ> there is a strong increasing trend in the time per call as n increases.

SJ> I don't suspect the problem is in strip itself. I suspect the problem is
SJ> that REXML creates a string that grows as more input is read and hands
SJ> it repeatedly to strip. I looked briefly at the REXML code and didn't
SJ> see anything obvious. (Other than the fact that calling strip on a
SJ> buffer just to see whether it has any tokens in it is inefficient. That
SJ> just makes it slow, however, not nonlinear--unless the buffer is growing.)

SJ> I hope this is enough for someone to diagnose and fix the problem. I'll
SJ> send the data to anyone who wants it.

You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.

Yes, no doubt, i'm not an open source fan.
Tho i think this thread would be more relevant on the rexml ml, I have
been reading the thread and it does interest me (I've not been
responding because i prob could not help). I think you've been dilegent
and helpful in pointing out either a bug or performance hit. I did not
think you were asking for anything for free or whatever - besides, to
me, that's what these ml's are for, in any case. Good luck in
finding/helping find the culprit for the performance problem.
 
S

Steven Jenkins

Lothar said:
You are the only one talking with you in this thread.

Other people are reading it. You read it. It gets archived.
Do you really think that someone else will do your job ? Do it for free ?

It's not my job. I observed something interesting and I reported it.
Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.

Thanks for the advice. I reported this problem Saturday morning US
Pacific time. It is now Sunday morning. In the intervening day others
and I have (1) confirmed my observations, (2) carried on an off-list
discussion of the problem, and (3) applied a patch developed by kapheine
that completely solves the problem.

And we all did it for free. It was even fun.
Yes, no doubt, i'm not an open source fan.

So what?

Steve
 
G

Gavin Sinclair

Thanks for the advice. I reported this problem Saturday morning US
Pacific time. It is now Sunday morning. In the intervening day others
and I have (1) confirmed my observations, (2) carried on an off-list
discussion of the problem, and (3) applied a patch developed by kapheine
that completely solves the problem.

Well done. I was just about to send a private note thanking you for
your diligence and good investigation and reporting. Lothar's poor
public comment, however, has compelled me to make my comment public.

So thank you :)

Cheers,
Gavin
 
R

Rich

Open source == Learn it yourself by trying and get it right if you can,
with as much help as others are willing to give you in your pursuit.

That is one reason why I've loved Ruby so much. The community _is_ a
community, not just a collection of individuals with similar traits.

I don't think I have a clue about programming yet, but there are those who
are willing to help me get a clue.

I will forever be grateful to those who help - no matter how large or small
their help might end up being.
You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.

Yes, no doubt, i'm not an open source fan.

No doubt I _am_ an open source and Ruby fan.

-Rich
 
T

Tim Hunter

That is one reason why I've loved Ruby so much. The community _is_ a
community, not just a collection of individuals with similar traits.

Here's something I never thought I'd get from working on an O/S project:
I'm in the U.S., but I've communicated via email with Ruby users in Spain,
Germany, England, Austrailia, and Japan. We're not just a community, but a
_global_ community.
 
M

Michael Vondung

You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ?

Hmm, that's the third or fifth message I am reading from you today,
and they all were negative and/or rude. You don't seem to like Ruby or
its community much. Someone asked you, in another thread, if you are a
troll. I believe the question was on target.
Yes, no doubt, i'm not an open source fan.

Quite apparently not one of decent manners, either.

M.
 
D

Dick Davies

* Sam Roberts said:
Quoteing (e-mail address removed), on Wed, Feb 04, 2004 at 12:35:02AM +0900:

Which is particularly weird, given that Lothar is the principle of a
company which claims to be developing IDEs for open source scripting
languages (betas expected at end of the year...)!

Like Ruby:

http://www.ruby-ide.com/arachno_ruby.php

Maybe its a troll impersonating him?
Or he has MPD?
 
S

Sean Russell

Steven Jenkins said:


Steve,

Sorry for the delay. I've been really distracted for the past few
months with this and that. I'll check your workaround, and your bug
report. I've got a ton of work to do with REXML; one reason I've been
taking so long to fix things is that the performance issue with 1.8.x
is really bothering me; the memory usage of REXML is much worse under
1.8 than it is under 1.6, and I've been blocked trying to figure that
out.

Anyway, sorry -- to everybody -- for being such a laggard. Things
should be better now.

--- SER
 
S

Steven Jenkins

Sean said:
Sorry for the delay. I've been really distracted for the past few
months with this and that.

I know how that goes. Fortunately, Zachary Landau already had a
workaround; I just had to discover it.

Good luck with the 1.8 work. I use REXML in a couple of simple but
important applications. I'm happy with it overall. Performance, as you
note, could be better, but it's not a show-stopper.

Steve
 
S

Sean Russell

Steven Jenkins said:
Good luck with the 1.8 work. I use REXML in a couple of simple but
important applications. I'm happy with it overall. Performance, as you
note, could be better, but it's not a show-stopper.

Yeah, well... I think REXML is at the point where I'm squeezing
optimizations out of it. Many of the optimizations I've tried have
made things worse, and I end up backing them out. I'm not out of
ideas yet, but I doubt that we'll ever see an order of magnitude
improvement, until I write an optional back-end that makes use of a
native parsing library.

One thing you can do, though, is drop out of the advanced API and use
the light parser. The light parser API is pretty limited, at the
moment; it'll be a lot more useful when I get it hooked into XPath,
but it should nearly double the speed of parsing. It does this by
eschewing the heavy, traditional, REXML tree objects. The price is
that you don't get all of the neat helper functions; at that point,
you're dealing with pretty raw data. The API isn't well documented
yet, but it works.

--- SER
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,190
Latest member
ClayE7480

Latest Threads

Top