Ruby/Odeum vs. Lucene Performance

Discussion in 'Ruby' started by Zed A. Shaw, May 15, 2005.

  1. Zed A. Shaw

    Zed A. Shaw Guest

    Hi All,

    At the risk of starting a major flame war and giving Java player-haters
    more fuel for their ire, I've done a performance comparison between
    Ruby/Odeum and Lucene:

    http://www.zedshaw.com/projects/ruby_odeum/performance.html

    Please don't take this as a "Java sucks Ruby rulez" posting, or that
    I've done any sort of scientific analysis here. I'm a professional Java
    developer so I'm agnostic about the language wars. It's simply intended
    to answer a common question I get related to Ruby/Odeum.

    For the people who can't or won't read, the test is informal and shows
    that Ruby/Odeum is about 10 times faster when doing a search.

    Comments are welcome.

    Zed A. Shaw
     
    Zed A. Shaw, May 15, 2005
    #1
    1. Advertising

  2. Zed A. Shaw

    ES Guest

    Le 15/5/2005, "Zed A. Shaw" <> a =E9crit:
    >Hi All,
    >
    >At the risk of starting a major flame war and giving Java player-haters
    >more fuel for their ire, I've done a performance comparison between
    >Ruby/Odeum and Lucene:
    >
    >http://www.zedshaw.com/projects/ruby_odeum/performance.html
    >
    >Please don't take this as a "Java sucks Ruby rulez" posting, or that
    >I've done any sort of scientific analysis here. I'm a professional Java
    >developer so I'm agnostic about the language wars. It's simply intended
    >to answer a common question I get related to Ruby/Odeum.
    >
    >For the people who can't or won't read, the test is informal and shows
    >that Ruby/Odeum is about 10 times faster when doing a search.
    >
    >Comments are welcome.


    Woo hoo! Java sux0r, Ruby rules!

    >Zed A. Shaw


    E

    P.S. Good library :)

    --
    template<typename duck>
    void quack(duck& d) { d.quack(); }
     
    ES, May 15, 2005
    #2
    1. Advertising

  3. Zed A. Shaw

    Robert Feldt Guest

    > >For the people who can't or won't read, the test is informal and shows
    > >that Ruby/Odeum is about 10 times faster when doing a search.
    > >

    Even though performance comparisons are really hard to get right/fair
    this is a very nice indication. I will probably use Ruby/Odeum in an
    ongoing project.

    Thanks,

    Robert
     
    Robert Feldt, May 16, 2005
    #3
  4. In message <>, Zed A. Shaw
    <> writes
    >For the people who can't or won't read, the test is informal and shows
    >that Ruby/Odeum is about 10 times faster when doing a search.


    You should be able to compare the Ruby/JVM startup times by writing
    minimal apps for each that are effectively

    void main()
    {
    }

    Run each 1000 times and compare.

    I think 5 times is far to few when you are relying on the OS to load
    stuff etc. You should discard the first time, as all subsequent times
    will most likely bring your DLLs/SOs from cache.

    For what its worth, on my 1GHz Athlon Windows XP box when I run Java
    Performance Validator and Ruby Performance Validator, I get the
    impression that Ruby startup time is longer than JVM startup time. But
    then again there is all the time of the injected stub from JPV/RPV as
    well, and may the RPV stub is taking longer not Ruby.

    If Ruby startup time is longer than JVM startup time, that means you are
    doing an even better job than you thought :). This wouldn't surprise me
    as the Java String class is not built in, its JIT'd, whereas in Ruby the
    string support is builtin.

    Stephen
    --
    Stephen Kellett
    Object Media Limited
    Computer Consultancy, Software Development
    Windows C++, Java, Assembler, Performance Analysis, Troubleshooting
     
    Stephen Kellett, May 16, 2005
    #4
  5. Zed A. Shaw

    Zed A. Shaw Guest

    On Mon, 2005-05-16 at 20:00 +0900, Stephen Kellett wrote:
    > In message <>, Zed A. Shaw
    > <> writes
    > >For the people who can't or won't read, the test is informal and shows
    > >that Ruby/Odeum is about 10 times faster when doing a search.

    >
    > You should be able to compare the Ruby/JVM startup times by writing
    > minimal apps for each that are effectively
    >
    > void main()
    > {
    > }
    >
    > Run each 1000 times and compare.
    >

    Actually, I have a confession to make in that I anticipated this and set
    a trap. :)

    The first thing is that there's not statistical basis for "1000 times".
    You actually want to run the test several times in a series of sample
    runs and then determine the common ramp-up time from a cold start.
    Otherwise you'll never know if the few times you ran your "1000 times"
    test were just flukes or not.

    The second thing is that your simple main() for both systems actually
    isn't the "start-up time" since there is complexity in the class loader,
    hotspot JIT compilers, Ruby source translation, etc. All you are
    testing is the time it takes to load your one little main function.

    The actual way to test without the JVM and Ruby start-up times is to do
    the timing inside the JVM rather than outside. In other words, have a
    test case that just runs 1000 times and measure either the total time to
    do the one run, or average and standard deviation of each measurement.
    Again, when you do the test this way you have to figure out the common
    ramp-up time for the system so that you can remove them from the test
    case as outliers later.

    But, of course all of this would take way too long. I'll let Lucene
    folks go through that pain if they feel the need. :)

    > I think 5 times is far to few when you are relying on the OS to load
    > stuff etc. You should discard the first time, as all subsequent times
    > will most likely bring your DLLs/SOs from cache.
    >

    Your right in a way, but your idea that it is only the "first time"
    isn't quite right as the ramp-up period can vary between runs.

    FYI, I did the mean of 5 samples after running a few to get rid of
    ramp-up. I just "eye-balled" the ramp-up, so don't quote me on the
    validity at all.

    Also, there's solid statistics behind only doing a few samples, but I
    didn't use any of those techniques. I believe entire industries have
    been founded on papers with only 3 samples. :)

    > For what its worth, on my 1GHz Athlon Windows XP box when I run Java
    > Performance Validator and Ruby Performance Validator, I get the
    > impression that Ruby startup time is longer than JVM startup time. But
    > then again there is all the time of the injected stub from JPV/RPV as
    > well, and may the RPV stub is taking longer not Ruby.
    >

    Interesting.

    > If Ruby startup time is longer than JVM startup time, that means you are
    > doing an even better job than you thought :). This wouldn't surprise me
    > as the Java String class is not built in, its JIT'd, whereas in Ruby the
    > string support is builtin.
    >

    I don't know, the JVM JIT really punishes command line tools to death,
    and it's such a pain to turn it off. JIT rocks for long running
    processes, but a test like this is probably being seriously punished.
    That's why I was a bit sheepish about the 10 times faster claim without
    specifically saying that I wanted to include start-up time since I'm
    writing a CLI tool.

    Thanks for the comments.

    Zed
     
    Zed A. Shaw, May 17, 2005
    #5
  6. In message <>, Zed A. Shaw
    <> writes
    >The first thing is that there's not statistical basis for "1000 times".


    There is. The error is smaller. If you don't believe me you need to
    examine why pollsters always ask at least 1000 potential voters their
    opinion. The error rate is +/- 3% with a sample size of approx 1000
    voters. Ask 10 people and predict the election result and your error
    will be much greater than 3%. The pollsters are in it to make money
    predicting outcomes. If they could get away with 5 or 10 samples, they
    would. It would be more profitable. They don't do it that way.

    Also, having written timing analysis programs and deliberately written
    in options to allow me to run the test once, 10 times, a million times
    whatever, I notice the more tests you run, the errors from the fast one
    and the slow one get averaged out and you get closer to what the real
    result it.

    I disagree with you.

    >You actually want to run the test several times in a series of sample
    >runs and then determine the common ramp-up time from a cold start.


    Well if you want to do that, I hope you cold start includes a reboot of
    the machine. You don't want anything in the cache.

    >Otherwise you'll never know if the few times you ran your "1000 times"
    >test were just flukes or not.


    I think you misunderstand me. I mean you need to run your test 1000
    times, not put you test in a loop for a 1000 times and run it a few
    times. If you are doing that from a cold-start (after boot up) I can see
    why you wouldn't want to do that :). If you are just doing it from the
    command line, wrap it in a shell script to execute ruby/jvm 1000 times.

    The number doesn't have to be 1000, it needs to be something large
    enough to make the error small enough to be discountable. You decided
    what is discountable for your purposes.

    >Also, there's solid statistics behind only doing a few samples, but I
    >didn't use any of those techniques. I believe entire industries have
    >been founded on papers with only 3 samples. :)


    I think you are mistaken. Back to the pollsters again...
    You are voting liberal, he is voting conservative and she is voting
    Labour. So whats the result of the election? :)

    Stephen
    (For American readers, replace with Ralf Nader, Republican and
    Democrat).
    --
    Stephen Kellett
    Object Media Limited
    Computer Consultancy, Software Development
    Windows C++, Java, Assembler, Performance Analysis, Troubleshooting
     
    Stephen Kellett, May 17, 2005
    #6
  7. Stephen Kellett wrote:
    > In message <>, Zed A. Shaw
    > <> writes
    >
    >> The first thing is that there's not statistical basis for "1000 times".

    >
    > There is. The error is smaller. If you don't believe me you need to
    > examine why pollsters always ask at least 1000 potential voters their
    > opinion. The error rate is +/- 3% with a sample size of approx 1000
    > voters. Ask 10 people and predict the election result and your error
    > will be much greater than 3%. The pollsters are in it to make money
    > predicting outcomes. If they could get away with 5 or 10 samples, they
    > would. It would be more profitable. They don't do it that way.


    True (mostly), but irrelevant. Those statistics apply to problems of
    estimating proportions, but this isn't one.

    Characterizing performance of systems like this can expressed as a
    simple linear regression problem:

    t = a + bx + e

    where

    t = runtime
    a = fixed overhead (startup, teardown, etc.)
    b = runtime per 'size' unit
    x = size of request or returned data
    e = random error

    Choose N values of x and observe their corresponding t values. Estimate
    a and b using standard regression techniques.

    The "goodness" (i.e., the variance) of the estimates of a and b depends
    on the variance of e and the value of N. If var(e) is small, you can get
    good estimates of a and b with small N. In particular, if var(e) = 0,
    you can get perfect estimates of a and b with N = 2.

    If I needed 1000 samples to get good estimates of performance of an
    information system, I'd stop trying to overcome that with large numbers
    and figure out why randomness plays such a large role in the performance
    of my system.

    Steve
     
    Steven Jenkins, May 17, 2005
    #7
  8. In message <>, Steven Jenkins
    <> writes
    >and figure out why randomness plays such a large role in the performance
    >of my system.


    The execution of other programs running on your multi-process capable
    computer. Hence my rationale.

    Stephen
    --
    Stephen Kellett
    Object Media Limited
    Computer Consultancy, Software Development
    Windows C++, Java, Assembler, Performance Analysis, Troubleshooting
     
    Stephen Kellett, May 17, 2005
    #8
  9. Stephen Kellett wrote:
    > In message <>, Steven Jenkins
    > <> writes
    >
    >> and figure out why randomness plays such a large role in the performance
    >> of my system.

    >
    > The execution of other programs running on your multi-process capable
    > computer. Hence my rationale.


    Your rationale invoked an analogy to estimation of proportions, which,
    as I noted, does not apply to the problem at hand. The mere existence of
    random disturbances does not imply the need to run 1000 tests.

    The right way to do it is to calculate the variances, or better still,
    the confidence intervals of the performance estimates. Pardon my
    bluntness, but anything else is hand-waving.

    Steve
     
    Steven Jenkins, May 17, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Zed A. Shaw
    Replies:
    3
    Views:
    139
    Brian McCallister
    Apr 13, 2005
  2. Zed A. Shaw
    Replies:
    2
    Views:
    110
    Zed A. Shaw
    Apr 23, 2005
  3. Zed A. Shaw
    Replies:
    0
    Views:
    89
    Zed A. Shaw
    Apr 20, 2005
  4. Zed A. Shaw

    [ANN] Ruby/Odeum 0.3.1 Pre-Release

    Zed A. Shaw, May 13, 2005, in forum: Ruby
    Replies:
    0
    Views:
    94
    Zed A. Shaw
    May 13, 2005
  5. Zed A. Shaw

    Ruby/Odeum vs. Lucene: Part 2

    Zed A. Shaw, May 31, 2005, in forum: Ruby
    Replies:
    0
    Views:
    97
    Zed A. Shaw
    May 31, 2005
Loading...

Share This Page