Performance issues with large files -- ruby vs. python :)

Discussion in 'Ruby' started by sa 125, Nov 27, 2008.

  1. sa 125

    sa 125 Guest

    Hi all -

    I'm new to ruby after working with python for a while. My work is
    performing data mining and doing some web dev for my company. We
    recently started looking at rails, and I wanted to see if it's worth
    migrating some of my code from python to ruby. So much for the intro.

    I've re-written a script that extracts data from a very large csv file
    (~8 million rows or so, almost 1Gb in size). It does so by iterating
    through the rows and building a tree-like hash (dict in python) in
    memory that will later be written to our DB. The structure is something
    like this:

    {date =>
    { company =>
    { product => [array of relevant info] } } }

    This way I only get the data I need, sort it on the fly, and speed up
    the process. The idea is to propagate down the keys if some field has
    duplicate values -- I hope that makes some sense.. Anyway, I copied my
    python code, and basically translated it to ruby.

    This is where it got interesting -- after solving all the quirks and
    getting it to run, it appeared to be super slow compared to python. I
    mean nearly an 30 min in ruby vs under 2 minutes in python code. It's
    worth mentioning that I used the psyco module in python, and that I did
    my testing on win-xp.

    Since I'm a newbie and **really** don't want to spark a python/ruby
    talk-back war (I actually like ruby a lot from what I've seen so far), I
    was just wondering if there's something I might have missed, like a
    psyco equivalent module for ruby or something else to narrow the gap.

    I'd appreciate feedback - thanks!
    --
    Posted via http://www.ruby-forum.com/.
     
    sa 125, Nov 27, 2008
    #1
    1. Advertising

  2. >

    It would be nice if you could give us some hints on what libraries you
    are using and how
    you code looks like. If we are talking about such intensive talks, it
    might be that you are
    getting something wrong about memory handling and such.

    But, to try a shot: do you use the standard csv-library or do you use
    fastercsv?

    Regards,
    Florian Gilcher
     
    Florian Gilcher, Nov 27, 2008
    #2
    1. Advertising

  3. sa 125

    sa 125 Guest

    Re: Performance issues with large files -- ruby vs. python :

    I can't really put the code here since it's on the company's intranet. I
    use the fastercsv and mysql libraries. Basically I want to grab the
    first and last record of every date/company/product combo, and store
    it's row info.

    The core processing is done through if statements

    @main_hash = {}
    csv = FasterCSV.open(file_path, "r", :headers => true)

    #...code below is in loop: for row in csv...

    if not @main_hash.keys.member?(date)
    @main_hash[date] = {}
    @main_hash[date][company] = {}
    @main_hash[date][company][prod] = {}
    @main_hash[date][company][prod] = row_values
    else
    if not @main_hash[date].keys.member?(company)
    @main_hash[date][company] = {}
    @main_hash[date][company][prod] = {}
    @main_hash[date][company][prod] = row_values
    else
    if not @main_hash[date][company].keys.member?(prod)
    @main_hash[date][company][prod] = {}
    @main_hash[date][company][prod] = row_values
    end
    end
    end

    # loop ends

    This is basically the part of the code that runs slow. I keep track of
    progress in percentage (file position / file size) throughout the loop.
    I should mention I extract the row values into loop variables, like
    date/company/prod using the csv headers: date = row['Date'], etc.

    the row_values variable is an array containing all the relevant
    parameters from the row. The @main_hash variable obviously takes up some
    memory. There are a couple of if-statements, but not much else. That's
    pretty much all I can think about. Thanks!

    --
    Posted via http://www.ruby-forum.com/.
     
    sa 125, Nov 27, 2008
    #3
  4. sa 125

    Guest

    Re: Performance issues with large files -- ruby vs. python :

    On Thu, Nov 27, 2008 at 9:39 AM, sa 125 <> wrote:
    > The core processing is done through if statements
    >
    > #...code below is in loop: for row in csv...
    >
    > if not @main_hash.keys.member?(date)
    > @main_hash[date] = {}
    > @main_hash[date][company] = {}
    > @main_hash[date][company][prod] = {}
    > @main_hash[date][company][prod] = row_values
    > else
    > if not @main_hash[date].keys.member?(company)
    > @main_hash[date][company] = {}
    > @main_hash[date][company][prod] = {}
    > @main_hash[date][company][prod] = row_values
    > else
    > if not @main_hash[date][company].keys.member?(prod)
    > @main_hash[date][company][prod] = {}
    > @main_hash[date][company][prod] = row_values
    > end
    > end
    > end
    >
    > # loop ends


    unless main_hash[date]
    main_hash[date] = {}
    main_hash[date][company] = {}
    main_hash[date][company][prod] = row_values
    else
    unless main_hash[date][company]
    main_hash[date][company] = {}
    main_hash[date][company][prod] = row_values
    else
    unless main_hash[date][company][prod]
    main_hash[date][company][prod] = row_values
    end
    end
    end

    main_hash[date] ||= {}
    main_hash[date][company] ||= {}
    main_hash[date][company][prod] ||= row_values

    d = main_hash[date] ||= {}
    c = d[company] ||= {}
    p = c[prod] ||= row_values

    ((@main_hash[date] ||= {})[company] ||= {})[prod] ||= row_values
     
    , Nov 27, 2008
    #4
  5. sa 125

    Guest

    Re: Performance issues with large files -- ruby vs. python :

    On Thu, Nov 27, 2008 at 10:27 AM, <> wrote:
    > On Thu, Nov 27, 2008 at 9:39 AM, sa 125 <> wrote:
    >> if not @main_hash.keys.member?(date)
    >> @main_hash[date] = {}

    >
    > main_hash[date] ||= {}


    Simple benchmark:
    http://gist.github.com/29792
     
    , Nov 27, 2008
    #5
  6. Re: Performance issues with large files -- ruby vs. python :

    wrote:
    > unless main_hash[date]
    > main_hash[date] = {}
    > main_hash[date][company] = {}
    > main_hash[date][company][prod] = row_values
    > else
    > unless main_hash[date][company]
    > main_hash[date][company] = {}
    > main_hash[date][company][prod] = row_values
    > else
    > unless main_hash[date][company][prod]
    > main_hash[date][company][prod] = row_values
    > end
    > end
    > end


    DO NOT use unless...else. It's the most confusing conditional construct
    ever. Use if and flop the bodies, or just use if !condition.

    - Charlie
     
    Charles Oliver Nutter, Nov 27, 2008
    #6
  7. Re: Performance issues with large files -- ruby vs. python :

    sa 125 wrote:
    > if not @main_hash.keys.member?(date)


    @main_hash.keys will create an array of all the keys, which will be
    expensive when it's large, and member? will do a linear search, which is
    also expensive. You can replace with:

    if not @main_hash.has_key?(date)

    or more simply

    if not @main_hash[date]

    > @main_hash[date] = {}
    > @main_hash[date][company] = {}
    > @main_hash[date][company][prod] = {}
    > @main_hash[date][company][prod] = row_values


    The third line does nothing, because it's replaced by the fourth line. I
    think all you need is:

    @main_hash[date] = { company => { prod => row_values } }

    > else
    > if not @main_hash[date].keys.member?(company)
    > @main_hash[date][company] = {}
    > @main_hash[date][company][prod] = {}
    > @main_hash[date][company][prod] = row_values
    > else
    > if not @main_hash[date][company].keys.member?(prod)
    > @main_hash[date][company][prod] = {}
    > @main_hash[date][company][prod] = row_values
    > end
    > end
    > end


    You can rewrite this as above too.

    But looking at this, I think you can replace *all* this code with just
    the following three lines:

    @main_hash[date] ||= {}
    @main_hash[date][company] ||= {}
    @main_hash[date][company][prod] ||= row_values

    Note that a ||= b is the same as a = a || b, which will assign b to a
    only if a is nil or false.

    > This is basically the part of the code that runs slow. I keep track of
    > progress in percentage (file position / file size) throughout the loop.


    There is also the ruby profiler, which you can turn on/off where needed,
    or just run the whole lot with ruby -rprofile (beware: makes your code
    run *much* slower)

    Regards,

    Brian.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Nov 27, 2008
    #7
  8. sa 125

    Robert Dober Guest

    Re: Performance issues with large files -- ruby vs. python :

    On Thu, Nov 27, 2008 at 4:57 PM, Charles Oliver Nutter
    <> wrote:

    > DO NOT use unless...else. It's the most confusing conditional construct
    > ever. Use if and flop the bodies, or just use if !condition.


    I have never heard you shout before, *unless* I am mistaken ;).
    I always felt that unless is just the human readable way of saying if
    not, but accountants do not have any taste (or something like that ;).

    Anyway here I would rather apply Ruby's on steroid (citing Dave
    Thomas) construct "case".

    case
    when not h....
    ....
    when not h....
    ....
    etc.etc
    else
    ....
    end

    I am not sure if faster code is generated but I would guess so.

    All that said I wonder if your code could not benefit of this kind of
    initialization

    @main_hash =3D Hash::new{ |h, k| h[k] =3D Hash::new{ |h, k| h[k] =3D {} I
    let *you* close all those accolades ;)

    HTH
    Robert

    --=20
    Ne baisse jamais la t=EAte, tu ne verrais plus les =E9toiles.

    Robert Dober ;)
     
    Robert Dober, Nov 27, 2008
    #8
  9. sa 125

    Guest

    Re: Performance issues with large files -- ruby vs. python :

    On Thu, Nov 27, 2008 at 11:47 AM, Robert Dober <> wrote:
    > On Thu, Nov 27, 2008 at 4:57 PM, Charles Oliver Nutter
    > <> wrote:
    >
    >> DO NOT use unless...else. It's the most confusing conditional construct
    >> ever. Use if and flop the bodies, or just use if !condition.

    >
    > I have never heard you shout before, *unless* I am mistaken ;).
    > I always felt that unless is just the human readable way of saying if
    > not, but accountants do not have any taste (or something like that ;).


    I don't think the confusion issue was simply unless, but rather
    pairing unless with else.

    # not confusing
    unless a == b
    #foo
    end

    # confusing
    unless a == b
    #foo
    else
    #bar
    end

    # not confusing
    if a == b
    #bar
    else
    #foo
    end

    # or
    if a != b
    #foo
    else
    #bar
    end
     
    , Nov 27, 2008
    #9
  10. sa 125

    Robert Dober Guest

    Re: Performance issues with large files -- ruby vs. python :

    On Thu, Nov 27, 2008 at 6:12 PM, <> wrote:
    > On Thu, Nov 27, 2008 at 11:47 AM, Robert Dober <> wrote:
    >> On Thu, Nov 27, 2008 at 4:57 PM, Charles Oliver Nutter
    >> <> wrote:
    >>
    >>> DO NOT use unless...else. It's the most confusing conditional construct
    >>> ever. Use if and flop the bodies, or just use if !condition.

    >>
    >> I have never heard you shout before, *unless* I am mistaken ;).
    >> I always felt that unless is just the human readable way of saying if
    >> not, but accountants do not have any taste (or something like that ;).

    >
    > I don't think the confusion issue was simply unless, but rather
    > pairing unless with else.

    Hmm that indeed is a little more misleading, sorry for not getting
    this but the ellipsis put me astray;).
     
    Robert Dober, Nov 27, 2008
    #10
  11. Re: Performance issues with large files -- ruby vs. python :

    On Nov 27, 2008, at 12:29 PM, Robert Dober wrote:

    > On Thu, Nov 27, 2008 at 6:12 PM, <> wrote:
    >> On Thu, Nov 27, 2008 at 11:47 AM, Robert Dober <
    >> > wrote:
    >>> On Thu, Nov 27, 2008 at 4:57 PM, Charles Oliver Nutter
    >>> <> wrote:
    >>>
    >>>> DO NOT use unless...else. It's the most confusing conditional
    >>>> construct
    >>>> ever. Use if and flop the bodies, or just use if !condition.
    >>>
    >>> I have never heard you shout before, *unless* I am mistaken ;).
    >>> I always felt that unless is just the human readable way of saying
    >>> if
    >>> not, but accountants do not have any taste (or something like
    >>> that ;).

    >>
    >> I don't think the confusion issue was simply unless, but rather
    >> pairing unless with else.

    > Hmm that indeed is a little more misleading, sorry for not getting
    > this but the ellipsis put me astray;).


    Not entirely on topic, but the biggest issue I've had with
    "unless...else" is that you can't do "unless...elsif...else".

    I somewhat frequently find myself adding conditionals for extra corner
    cases, and in an "unless...else" block, that means re-writing the
    whole thing as an "if not...else block" anyway.

    -Josh
     
    Joshua Ballanco, Nov 27, 2008
    #11
  12. sa 125

    Monika Moser Guest

    [Note: parts of this message were removed to make it a legal post.]

    2008/11/27 sa 125 <>

    > Hi all -
    >
    > I'm new to ruby after working with python for a while. My work is
    > performing data mining and doing some web dev for my company. We
    > recently started looking at rails, and I wanted to see if it's worth
    > migrating some of my code from python to ruby. So much for the intro.
    >
    > I've re-written a script that extracts data from a very large csv file
    > (~8 million rows or so, almost 1Gb in size). It does so by iterating
    > through the rows and building a tree-like hash (dict in python) in
    > memory that will later be written to our DB. The structure is something
    > like this:
    >
    > {date =>
    > { company =>
    > { product => [array of relevant info] } } }
    >
    > This way I only get the data I need, sort it on the fly, and speed up
    > the process. The idea is to propagate down the keys if some field has
    > duplicate values -- I hope that makes some sense.. Anyway, I copied my
    > python code, and basically translated it to ruby.
    >
    > This is where it got interesting -- after solving all the quirks and
    > getting it to run, it appeared to be super slow compared to python. I
    > mean nearly an 30 min in ruby vs under 2 minutes in python code. It's
    > worth mentioning that I used the psyco module in python, and that I did
    > my testing on win-xp.
    >
    > Since I'm a newbie and **really** don't want to spark a python/ruby
    > talk-back war (I actually like ruby a lot from what I've seen so far), I
    > was just wondering if there's something I might have missed, like a
    > psyco equivalent module for ruby or something else to narrow the gap.
    >
    > I'd appreciate feedback - thanks!
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >

    Hi,
    I do have a very similar problem: large data files and csv-import with
    fastercsv which is much slower than an implementation in c++.
    I was wondering if you have some interesting insights about that meanwhile
    which you would like to share :).
    Thanks,
    Monika
     
    Monika Moser, May 13, 2009
    #12
  13. 2008/11/27 Florian Gilcher <>:
    > It would be nice if you could give us some hints on what libraries you are
    > using and how
    > you code looks like. If we are talking about such intensive talks, it might
    > be that you are
    > getting something wrong about memory handling and such.
    >
    > But, to try a shot: do you use the standard csv-library or do you use
    > fastercsv?


    Also, which Ruby version are you using? In my experience, Ruby 1.9.1
    is significantly faster than all 1.8 versions and it has also better
    memory usage characteristics.

    Kind regards

    robert


    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, May 13, 2009
    #13
  14. sa 125

    sa 125 Guest

    > I was wondering if you have some interesting insights about that
    > meanwhile which you would like to share :).
    > Thanks,
    > Monika


    Hi Monika,

    it's been a while since I had that problem, and I ended up sticking with
    pythong for that particular issue. However, I found the reading the csv
    file with a simple file IO
    >> File.open('path/to/file', 'r').each {|row| # do stuff }

    is much faster than using FasterCSV. It will not be as convenient as
    fastercsv, but if speed is what you're going for it just might do.

    I must add that I haven't tried ruby 1.9 yet, which is said to have a
    much faster interpreter than 1.8 - though I don't know if it's
    compatible with the fastercsv library. If you end up finding out about
    it, please let me know how it works out.

    Thanks!

    --
    Posted via http://www.ruby-forum.com/.
     
    sa 125, May 14, 2009
    #14
  15. Monika Moser wrote:
    > Hi,
    > I do have a very similar problem: large data files and csv-import with
    > fastercsv which is much slower than an implementation in c++.
    > I was wondering if you have some interesting insights about that meanwhile
    > which you would like to share :).


    You may want to try JRuby; many people who have moved to JRuby have done
    so explicitly because the perf characteristics of large data sets are
    very good.

    - Charlie
     
    Charles Oliver Nutter, May 14, 2009
    #15
  16. sa 125

    James Gray Guest

    On May 13, 2009, at 3:49 AM, Monika Moser wrote:

    > 2008/11/27 sa 125 <>
    >
    >> Hi all -
    >>
    >> I'm new to ruby after working with python for a while. My work is
    >> performing data mining and doing some web dev for my company. We
    >> recently started looking at rails, and I wanted to see if it's worth
    >> migrating some of my code from python to ruby. So much for the intro.
    >>
    >> I've re-written a script that extracts data from a very large csv
    >> file
    >> (~8 million rows or so, almost 1Gb in size). It does so by iterating
    >> through the rows and building a tree-like hash (dict in python) in
    >> memory that will later be written to our DB. The structure is
    >> something
    >> like this:
    >>
    >> {date =>
    >> { company =>
    >> { product => [array of relevant info] } } }
    >>
    >> This way I only get the data I need, sort it on the fly, and speed up
    >> the process. The idea is to propagate down the keys if some field has
    >> duplicate values -- I hope that makes some sense.. Anyway, I copied
    >> my
    >> python code, and basically translated it to ruby.
    >>
    >> This is where it got interesting -- after solving all the quirks and
    >> getting it to run, it appeared to be super slow compared to python. I
    >> mean nearly an 30 min in ruby vs under 2 minutes in python code. It's
    >> worth mentioning that I used the psyco module in python, and that I
    >> did
    >> my testing on win-xp.
    >>
    >> Since I'm a newbie and **really** don't want to spark a python/ruby
    >> talk-back war (I actually like ruby a lot from what I've seen so
    >> far), I
    >> was just wondering if there's something I might have missed, like a
    >> psyco equivalent module for ruby or something else to narrow the gap.
    >>
    >> I'd appreciate feedback - thanks!
    >> --
    >> Posted via http://www.ruby-forum.com/.
    >>
    >>

    > Hi,
    > I do have a very similar problem: large data files and csv-import with
    > fastercsv which is much slower than an implementation in c++.
    > I was wondering if you have some interesting insights about that
    > meanwhile
    > which you would like to share :).


    Well, I take it to be pretty obvious that FasterCSV (written in Ruby),
    isn't going to be as fast as a C/C++ parser. I believe Ruby has a C
    based parse though, if you want to go that way:

    http://rubyforge.org/projects/simplecsv

    You might also try FasterCSV's latest code, not yet released but in
    version control. It has a new parser that can be faster for some
    things.

    James Edward Gray II
     
    James Gray, May 14, 2009
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maxim
    Replies:
    0
    Views:
    417
    Maxim
    Jul 7, 2003
  2. Replies:
    7
    Views:
    4,132
    Patricia Shanahan
    Oct 15, 2007
  3. james
    Replies:
    2
    Views:
    1,847
    james
    Jul 31, 2008
  4. Chad Burt
    Replies:
    8
    Views:
    169
    Comfort Eagle
    Nov 24, 2006
Loading...

Share This Page