Performance issues with large files -- ruby vs. python :)

Discussion in 'Ruby' started by sa 125, Nov 27, 2008.

  1. sa 125

    sa 125 Guest

    Hi all -

    I'm new to ruby after working with python for a while. My work is
    performing data mining and doing some web dev for my company. We
    recently started looking at rails, and I wanted to see if it's worth
    migrating some of my code from python to ruby. So much for the intro.

    I've re-written a script that extracts data from a very large csv file
    (~8 million rows or so, almost 1Gb in size). It does so by iterating
    through the rows and building a tree-like hash (dict in python) in
    memory that will later be written to our DB. The structure is something
    like this:

    {date =>
    { company =>
    { product => [array of relevant info] } } }

    This way I only get the data I need, sort it on the fly, and speed up
    the process. The idea is to propagate down the keys if some field has
    duplicate values -- I hope that makes some sense.. Anyway, I copied my
    python code, and basically translated it to ruby.

    This is where it got interesting -- after solving all the quirks and
    getting it to run, it appeared to be super slow compared to python. I
    mean nearly an 30 min in ruby vs under 2 minutes in python code. It's
    worth mentioning that I used the psyco module in python, and that I did
    my testing on win-xp.

    Since I'm a newbie and **really** don't want to spark a python/ruby
    talk-back war (I actually like ruby a lot from what I've seen so far), I
    was just wondering if there's something I might have missed, like a
    psyco equivalent module for ruby or something else to narrow the gap.

    I'd appreciate feedback - thanks!
     
    sa 125, Nov 27, 2008
    #1
    1. Advertisements

  2. It would be nice if you could give us some hints on what libraries you
    are using and how
    you code looks like. If we are talking about such intensive talks, it
    might be that you are
    getting something wrong about memory handling and such.

    But, to try a shot: do you use the standard csv-library or do you use
    fastercsv?

    Regards,
    Florian Gilcher
     
    Florian Gilcher, Nov 27, 2008
    #2
    1. Advertisements

  3. sa 125

    sa 125 Guest

    I can't really put the code here since it's on the company's intranet. I
    use the fastercsv and mysql libraries. Basically I want to grab the
    first and last record of every date/company/product combo, and store
    it's row info.

    The core processing is done through if statements

    @main_hash = {}
    csv = FasterCSV.open(file_path, "r", :headers => true)

    #...code below is in loop: for row in csv...

    if not @main_hash.keys.member?(date)
    @main_hash[date] = {}
    @main_hash[date][company] = {}
    @main_hash[date][company][prod] = {}
    @main_hash[date][company][prod] = row_values
    else
    if not @main_hash[date].keys.member?(company)
    @main_hash[date][company] = {}
    @main_hash[date][company][prod] = {}
    @main_hash[date][company][prod] = row_values
    else
    if not @main_hash[date][company].keys.member?(prod)
    @main_hash[date][company][prod] = {}
    @main_hash[date][company][prod] = row_values
    end
    end
    end

    # loop ends

    This is basically the part of the code that runs slow. I keep track of
    progress in percentage (file position / file size) throughout the loop.
    I should mention I extract the row values into loop variables, like
    date/company/prod using the csv headers: date = row['Date'], etc.

    the row_values variable is an array containing all the relevant
    parameters from the row. The @main_hash variable obviously takes up some
    memory. There are a couple of if-statements, but not much else. That's
    pretty much all I can think about. Thanks!
     
    sa 125, Nov 27, 2008
    #3
  4. sa 125

    brabuhr Guest

    unless main_hash[date]
    main_hash[date] = {}
    main_hash[date][company] = {}
    main_hash[date][company][prod] = row_values
    else
    unless main_hash[date][company]
    main_hash[date][company] = {}
    main_hash[date][company][prod] = row_values
    else
    unless main_hash[date][company][prod]
    main_hash[date][company][prod] = row_values
    end
    end
    end

    main_hash[date] ||= {}
    main_hash[date][company] ||= {}
    main_hash[date][company][prod] ||= row_values

    d = main_hash[date] ||= {}
    c = d[company] ||= {}
    p = c[prod] ||= row_values

    ((@main_hash[date] ||= {})[company] ||= {})[prod] ||= row_values
     
    brabuhr, Nov 27, 2008
    #4
  5. sa 125

    brabuhr Guest

    Simple benchmark:
    http://gist.github.com/29792
     
    brabuhr, Nov 27, 2008
    #5
  6. DO NOT use unless...else. It's the most confusing conditional construct
    ever. Use if and flop the bodies, or just use if !condition.

    - Charlie
     
    Charles Oliver Nutter, Nov 27, 2008
    #6
  7. @main_hash.keys will create an array of all the keys, which will be
    expensive when it's large, and member? will do a linear search, which is
    also expensive. You can replace with:

    if not @main_hash.has_key?(date)

    or more simply

    if not @main_hash[date]
    The third line does nothing, because it's replaced by the fourth line. I
    think all you need is:

    @main_hash[date] = { company => { prod => row_values } }
    You can rewrite this as above too.

    But looking at this, I think you can replace *all* this code with just
    the following three lines:

    @main_hash[date] ||= {}
    @main_hash[date][company] ||= {}
    @main_hash[date][company][prod] ||= row_values

    Note that a ||= b is the same as a = a || b, which will assign b to a
    only if a is nil or false.
    There is also the ruby profiler, which you can turn on/off where needed,
    or just run the whole lot with ruby -rprofile (beware: makes your code
    run *much* slower)

    Regards,

    Brian.
     
    Brian Candler, Nov 27, 2008
    #7
  8. sa 125

    Robert Dober Guest

    I have never heard you shout before, *unless* I am mistaken ;).
    I always felt that unless is just the human readable way of saying if
    not, but accountants do not have any taste (or something like that ;).

    Anyway here I would rather apply Ruby's on steroid (citing Dave
    Thomas) construct "case".

    case
    when not h....
    ....
    when not h....
    ....
    etc.etc
    else
    ....
    end

    I am not sure if faster code is generated but I would guess so.

    All that said I wonder if your code could not benefit of this kind of
    initialization

    @main_hash =3D Hash::new{ |h, k| h[k] =3D Hash::new{ |h, k| h[k] =3D {} I
    let *you* close all those accolades ;)

    HTH
    Robert

    --=20
    Ne baisse jamais la t=EAte, tu ne verrais plus les =E9toiles.

    Robert Dober ;)
     
    Robert Dober, Nov 27, 2008
    #8
  9. sa 125

    brabuhr Guest

    I don't think the confusion issue was simply unless, but rather
    pairing unless with else.

    # not confusing
    unless a == b
    #foo
    end

    # confusing
    unless a == b
    #foo
    else
    #bar
    end

    # not confusing
    if a == b
    #bar
    else
    #foo
    end

    # or
    if a != b
    #foo
    else
    #bar
    end
     
    brabuhr, Nov 27, 2008
    #9
  10. sa 125

    Robert Dober Guest

    Hmm that indeed is a little more misleading, sorry for not getting
    this but the ellipsis put me astray;).
     
    Robert Dober, Nov 27, 2008
    #10
  11. Not entirely on topic, but the biggest issue I've had with
    "unless...else" is that you can't do "unless...elsif...else".

    I somewhat frequently find myself adding conditionals for extra corner
    cases, and in an "unless...else" block, that means re-writing the
    whole thing as an "if not...else block" anyway.

    -Josh
     
    Joshua Ballanco, Nov 27, 2008
    #11
  12. sa 125

    Monika Moser Guest

    [Note: parts of this message were removed to make it a legal post.]

    Hi,
    I do have a very similar problem: large data files and csv-import with
    fastercsv which is much slower than an implementation in c++.
    I was wondering if you have some interesting insights about that meanwhile
    which you would like to share :).
    Thanks,
    Monika
     
    Monika Moser, May 13, 2009
    #12
  13. Also, which Ruby version are you using? In my experience, Ruby 1.9.1
    is significantly faster than all 1.8 versions and it has also better
    memory usage characteristics.

    Kind regards

    robert
     
    Robert Klemme, May 13, 2009
    #13
  14. sa 125

    sa 125 Guest

    I was wondering if you have some interesting insights about that
    Hi Monika,

    it's been a while since I had that problem, and I ended up sticking with
    pythong for that particular issue. However, I found the reading the csv
    file with a simple file IOis much faster than using FasterCSV. It will not be as convenient as
    fastercsv, but if speed is what you're going for it just might do.

    I must add that I haven't tried ruby 1.9 yet, which is said to have a
    much faster interpreter than 1.8 - though I don't know if it's
    compatible with the fastercsv library. If you end up finding out about
    it, please let me know how it works out.

    Thanks!
     
    sa 125, May 14, 2009
    #14
  15. You may want to try JRuby; many people who have moved to JRuby have done
    so explicitly because the perf characteristics of large data sets are
    very good.

    - Charlie
     
    Charles Oliver Nutter, May 14, 2009
    #15
  16. sa 125

    James Gray Guest

    Well, I take it to be pretty obvious that FasterCSV (written in Ruby),
    isn't going to be as fast as a C/C++ parser. I believe Ruby has a C
    based parse though, if you want to go that way:

    http://rubyforge.org/projects/simplecsv

    You might also try FasterCSV's latest code, not yet released but in
    version control. It has a new parser that can be faster for some
    things.

    James Edward Gray II
     
    James Gray, May 14, 2009
    #16
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.