*Fast* way to process large files line by line

Discussion in 'Ruby' started by Devesh Agrawal, Nov 15, 2006.

  1. Hi Folks,

    I am using ruby to analyse a huge (around 60G) amount of my networking
    experiment data. Let me briefly describe my technique: I have to read
    around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
    contains traceroutes to lots of destinations at different times. I.E a
    file is basically a list of traceroutes launched from a given src (src =
    filename) launched at diff times. I want to get a structure like
    following: (list of all traceroutes from *all* src's at time 1), (list
    of all traceroutes from *all* src's at time 2)... and so on.

    For this I am using the following psuedocode:

    outputfile.open
    open all files f1..fn
    while (!(all files have eof))
    (f1..fn).each{|f|
    next if f.eof
    line = f.readline
    parse the line, and get a structure P out of it
    put P into a hashtable: H[P.time] << P

    check for eof conditions on f

    if (H has more than k keys ? (ie has it become very large))
    H.keys.sort{|t|
    outputfile << Marshal.dump(H[t])
    H.delete(t)
    }
    end
    }
    end
    close all files

    //Btw I can't use an array instead of a hashtable H, as the P.time's
    read across all files needn't be same.

    This is performing miserbly SLOW. I have the following questions:

    i. How fast is f.readline ?. I want to use the maximum buffering
    possible for largest speed gains. In ruby how do I set the buffer size.
    I looked through io.c, and it seems that readline essentially uses getc
    (stopping when it gets a newline). How can I set the buffer size for the
    underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

    ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
    even worse.

    iii. Is it bad to have around 40-50 files opened at the same time ?.

    iv. The program does use a lot of memory but not so much, around 30-40
    pc of 1G ram machine is used by it. So I think paging in/out is not a
    problem.

    v. Would coding the realine part in C using rubyinline offer me speed
    advantages ?

    vi. I am thinking of trying the following to reduce the time it takes,
    I would very much welcome your comments:

    a. Remove Marshal.dump [I don't need to strictly serialize objects,
    only dump the data and read it back] and replace it with some string
    form which is more compact. Actually is it possible to have something
    like fixed length structures like in C: Example I would want P to be
    like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
    the IO would be faster as I could just dump a fixed number of bytes to a
    file.

    b. Try to reduce the memory consumption of this by reducing k further
    so as the program doesn't page in/out.

    c. Can someone point me to a good sample code for reading a file line
    by line in C and then putting it into a ruby hashtable ?.
    d. How much of the slowness is due to the fact that it is ruby and not
    C ?

    To give you an idea of how slow this is actually: Just reading all the
    files
    line by line takes around 8-9 hrs. Whereas the above thing easily takes
    5-6
    days !!. And I am quite unable to run profile on my code as it is just
    too slow.

    I would be very grateful for your comments, and particularly if you have
    any suggestions/experience on doing this in a fast way.

    --Devesh Agrawal



    --
    Posted via http://www.ruby-forum.com/.
     
    Devesh Agrawal, Nov 15, 2006
    #1
    1. Advertising

  2. On 15/11/06, Devesh Agrawal <> wrote:
    > Hi Folks,
    >
    > I am using ruby to analyse a huge (around 60G) amount of my networking
    > experiment data. Let me briefly describe my technique: I have to read
    > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
    > contains traceroutes to lots of destinations at different times. I.E a
    > file is basically a list of traceroutes launched from a given src (src =
    > filename) launched at diff times. I want to get a structure like
    > following: (list of all traceroutes from *all* src's at time 1), (list
    > of all traceroutes from *all* src's at time 2)... and so on.
    >
    > For this I am using the following psuedocode:
    >
    > outputfile.open
    > open all files f1..fn
    > while (!(all files have eof))
    > (f1..fn).each{|f|
    > next if f.eof
    > line = f.readline
    > parse the line, and get a structure P out of it
    > put P into a hashtable: H[P.time] << P
    >
    > check for eof conditions on f
    >
    > if (H has more than k keys ? (ie has it become very large))
    > H.keys.sort{|t|
    > outputfile << Marshal.dump(H[t])
    > H.delete(t)
    > }
    > end
    > }
    > end
    > close all files
    >
    > //Btw I can't use an array instead of a hashtable H, as the P.time's
    > read across all files needn't be same.
    >
    > This is performing miserbly SLOW. I have the following questions:
    >
    > i. How fast is f.readline ?. I want to use the maximum buffering
    > possible for largest speed gains. In ruby how do I set the buffer size.
    > I looked through io.c, and it seems that readline essentially uses getc
    > (stopping when it gets a newline). How can I set the buffer size for the
    > underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
    >
    > ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
    > even worse.
    >
    > iii. Is it bad to have around 40-50 files opened at the same time ?.
    >
    > iv. The program does use a lot of memory but not so much, around 30-40
    > pc of 1G ram machine is used by it. So I think paging in/out is not a
    > problem.
    >
    > v. Would coding the realine part in C using rubyinline offer me speed
    > advantages ?
    >
    > vi. I am thinking of trying the following to reduce the time it takes,
    > I would very much welcome your comments:
    >
    > a. Remove Marshal.dump [I don't need to strictly serialize objects,
    > only dump the data and read it back] and replace it with some string
    > form which is more compact. Actually is it possible to have something
    > like fixed length structures like in C: Example I would want P to be
    > like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
    > the IO would be faster as I could just dump a fixed number of bytes to a
    > file.
    >
    > b. Try to reduce the memory consumption of this by reducing k further
    > so as the program doesn't page in/out.
    >
    > c. Can someone point me to a good sample code for reading a file line
    > by line in C and then putting it into a ruby hashtable ?.
    > d. How much of the slowness is due to the fact that it is ruby and not
    > C ?
    >
    > To give you an idea of how slow this is actually: Just reading all the
    > files
    > line by line takes around 8-9 hrs. Whereas the above thing easily takes
    > 5-6
    > days !!. And I am quite unable to run profile on my code as it is just
    > too slow.
    >
    > I would be very grateful for your comments, and particularly if you have
    > any suggestions/experience on doing this in a fast way.
    >
    > --Devesh Agrawal
    >
    >
    >
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >


    Could you not parrallelise the processing of each file? Perhaps using
    something like starfish (http://www.rufy.com/starfish/doc/)?

    Farrel
     
    Farrel Lifson, Nov 15, 2006
    #2
    1. Advertising

  3. Hi,
    Farrel Lifson wrote:
    > On 15/11/06, Devesh Agrawal <> wrote:
    >>
    >>
    >> close all files
    >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
    >> v. Would coding the realine part in C using rubyinline offer me speed
    >> the IO would be faster as I could just dump a fixed number of bytes to a
    >> To give you an idea of how slow this is actually: Just reading all the
    >>
    >>
    >>
    >> --
    >> Posted via http://www.ruby-forum.com/.
    >>
    >>

    >
    > Could you not parrallelise the processing of each file? Perhaps using
    > something like starfish (http://www.rufy.com/starfish/doc/)?


    Did you mean parrallelizing across multiple files or parrallelizing the
    processing of one file ?

    Yes and No. But this involves me getting deeper into a description of my
    problem:

    Each file has traceroutes at lots of times t1,t2.... The objective is to
    collect all traceroutes that happened at that time into one structure.
    Hence I could do something like this: Using ruby threads (or whatever)
    read each file, and store it into one *common* hashtable. And then call
    the syncing of the hashtable to the disk incase it has grown large
    enough.

    I was more hoping for some kind of a fast way to readlines, using say
    mmap or something like that. I read a few posts about how using mmap
    helped someone else. I will look into this starfish, I rejected ruby
    threads as unlike pthreads they weren't really true threads.

    Thanks for replying. Is there something wrong or inherently slow with
    the things I am doing ?. Can It be speeded up ?.

    > Farrel



    --
    Posted via http://www.ruby-forum.com/.
     
    Devesh Agrawal, Nov 15, 2006
    #3
  4. On 15/11/06, Devesh Agrawal <> wrote:
    > Hi,
    > Farrel Lifson wrote:
    > > On 15/11/06, Devesh Agrawal <> wrote:
    > >>
    > >>
    > >> close all files
    > >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
    > >> v. Would coding the realine part in C using rubyinline offer me speed
    > >> the IO would be faster as I could just dump a fixed number of bytes to a
    > >> To give you an idea of how slow this is actually: Just reading all the
    > >>
    > >>
    > >>
    > >> --
    > >> Posted via http://www.ruby-forum.com/.
    > >>
    > >>

    > >
    > > Could you not parrallelise the processing of each file? Perhaps using
    > > something like starfish (http://www.rufy.com/starfish/doc/)?

    >
    > Did you mean parrallelizing across multiple files or parrallelizing the
    > processing of one file ?
    >
    > Yes and No. But this involves me getting deeper into a description of my
    > problem:
    >
    > Each file has traceroutes at lots of times t1,t2.... The objective is to
    > collect all traceroutes that happened at that time into one structure.
    > Hence I could do something like this: Using ruby threads (or whatever)
    > read each file, and store it into one *common* hashtable. And then call
    > the syncing of the hashtable to the disk incase it has grown large
    > enough.
    >
    > I was more hoping for some kind of a fast way to readlines, using say
    > mmap or something like that. I read a few posts about how using mmap
    > helped someone else. I will look into this starfish, I rejected ruby
    > threads as unlike pthreads they weren't really true threads.
    >
    > Thanks for replying. Is there something wrong or inherently slow with
    > the things I am doing ?. Can It be speeded up ?.
    >
    > > Farrel

    >
    >
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >


    I think this
    if (H has more than k keys ? (ie has it become very large))
    H.keys.sort{|t|
    outputfile << Marshal.dump(H[t])
    H.delete(t)
    }
    end
    can just be changed to
    if (H has more than k keys ?)
    outputfile << Marshal.dump(H)
    end
    H = {}

    Farrel
     
    Farrel Lifson, Nov 15, 2006
    #4
  5. On 15/11/06, Farrel Lifson <> wrote:
    > On 15/11/06, Devesh Agrawal <> wrote:
    > > Hi,
    > > Farrel Lifson wrote:
    > > > On 15/11/06, Devesh Agrawal <> wrote:
    > > >>
    > > >>
    > > >> close all files
    > > >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
    > > >> v. Would coding the realine part in C using rubyinline offer me speed
    > > >> the IO would be faster as I could just dump a fixed number of bytes to a
    > > >> To give you an idea of how slow this is actually: Just reading all the
    > > >>
    > > >>
    > > >>
    > > >> --
    > > >> Posted via http://www.ruby-forum.com/.
    > > >>
    > > >>
    > > >
    > > > Could you not parrallelise the processing of each file? Perhaps using
    > > > something like starfish (http://www.rufy.com/starfish/doc/)?

    > >
    > > Did you mean parrallelizing across multiple files or parrallelizing the
    > > processing of one file ?
    > >
    > > Yes and No. But this involves me getting deeper into a description of my
    > > problem:
    > >
    > > Each file has traceroutes at lots of times t1,t2.... The objective is to
    > > collect all traceroutes that happened at that time into one structure.
    > > Hence I could do something like this: Using ruby threads (or whatever)
    > > read each file, and store it into one *common* hashtable. And then call
    > > the syncing of the hashtable to the disk incase it has grown large
    > > enough.
    > >
    > > I was more hoping for some kind of a fast way to readlines, using say
    > > mmap or something like that. I read a few posts about how using mmap
    > > helped someone else. I will look into this starfish, I rejected ruby
    > > threads as unlike pthreads they weren't really true threads.
    > >
    > > Thanks for replying. Is there something wrong or inherently slow with
    > > the things I am doing ?. Can It be speeded up ?.
    > >
    > > > Farrel

    > >
    > >
    > > --
    > > Posted via http://www.ruby-forum.com/.
    > >
    > >

    >
    > I think this
    > if (H has more than k keys ? (ie has it become very large))
    > H.keys.sort{|t|
    > outputfile << Marshal.dump(H[t])
    > H.delete(t)
    > }
    > end
    > can just be changed to
    > if (H has more than k keys ?)
    > outputfile << Marshal.dump(H)
    > end
    > H = {}
    >
    > Farrel
    >


    Whoops! make that
    if (H has more than k keys ?)
    outputfile << Marshal.dump(H)
    H={}
    end
     
    Farrel Lifson, Nov 15, 2006
    #5
  6. Devesh Agrawal

    Eric Hodel Guest

    On Nov 15, 2006, at 11:21 AM, Devesh Agrawal wrote:

    > Hi Folks,
    >
    > I am using ruby to analyse a huge (around 60G) amount of my
    > networking
    > experiment data. Let me briefly describe my technique: I have to read
    > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
    > contains traceroutes to lots of destinations at different times. I.E a
    > file is basically a list of traceroutes launched from a given src
    > (src =
    > filename) launched at diff times. I want to get a structure like
    > following: (list of all traceroutes from *all* src's at time 1), (list
    > of all traceroutes from *all* src's at time 2)... and so on.
    >
    > For this I am using the following psuedocode:
    >
    > outputfile.open
    > open all files f1..fn
    > while (!(all files have eof))
    > (f1..fn).each{|f|
    > next if f.eof
    > line = f.readline
    > parse the line, and get a structure P out of it
    > put P into a hashtable: H[P.time] << P
    >
    > check for eof conditions on f
    >
    > if (H has more than k keys ? (ie has it become very large))
    > H.keys.sort{|t|
    > outputfile << Marshal.dump(H[t])
    > H.delete(t)
    > }
    > end
    > }
    > end
    > close all files
    >
    > //Btw I can't use an array instead of a hashtable H, as the P.time's
    > read across all files needn't be same.
    >
    > This is performing miserbly SLOW. I have the following questions:


    Have you profiled? Where is your time really coming from?

    Repost with a profile and then we can give some real suggestions.

    > i. How fast is f.readline ?. I want to use the maximum buffering
    > possible for largest speed gains. In ruby how do I set the buffer
    > size.
    > I looked through io.c, and it seems that readline essentially uses
    > getc
    > (stopping when it gets a newline). How can I set the buffer size
    > for the
    > underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.


    I seriously doubt that this is your choke-point.

    > ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
    > even worse.


    Marshal.dump is pretty fast, probably as fast as you're going to get
    for a serialization format. _why did some benchmarks back in the day
    and it beat out the other P languages.

    That said, why are you even using it? Why not just add raw strings?

    > v. Would coding the realine part in C using rubyinline offer me speed
    > advantages ?


    No.

    (or, very unlikely)

    > vi. I am thinking of trying the following to reduce the time it
    > takes,
    > I would very much welcome your comments:


    Profile, profile, profile.

    > a. Remove Marshal.dump [I don't need to strictly serialize objects,
    > only dump the data and read it back] and replace it with some string
    > form which is more compact. Actually is it possible to have something
    > like fixed length structures like in C: Example I would want P to be
    > like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
    > the IO would be faster as I could just dump a fixed number of bytes
    > to a
    > file.


    Yes, do this, simpler is better.

    Try #pack and #unpack.

    > b. Try to reduce the memory consumption of this by reducing k
    > further so as the program doesn't page in/out.


    You already said it isn't paging...

    > c. Can someone point me to a good sample code for reading a file
    > line by line in C and then putting it into a ruby hashtable ?.


    No. Profile, profile, profile.

    > d. How much of the slowness is due to the fact that it is ruby
    > and not C ?


    We can't tell you without a profile. Profile, profile, profile.

    > To give you an idea of how slow this is actually: Just reading all the
    > files line by line takes around 8-9 hrs. Whereas the above thing
    > easily takes
    > 5-6 days !!. And I am quite unable to run profile on my code as it
    > is just
    > too slow.


    Lies.

    Use a reduced dataset and with ruby-prof or zenprofile.

    You know nothing without a profile.

    > I would be very grateful for your comments, and particularly if you
    > have
    > any suggestions/experience on doing this in a fast way.


    Profile it, you can't make sane changes without one.

    --
    Eric Hodel - - http://blog.segment7.net
    This implementation is HODEL-HASH-9600 compliant

    http://trackmap.robotcoop.com
     
    Eric Hodel, Nov 15, 2006
    #6
  7. On 15/11/06, Devesh Agrawal <> wrote:
    > Hi,
    > Farrel Lifson wrote:
    > > On 15/11/06, Devesh Agrawal <> wrote:
    > >>
    > >>
    > >> close all files
    > >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
    > >> v. Would coding the realine part in C using rubyinline offer me speed
    > >> the IO would be faster as I could just dump a fixed number of bytes to a
    > >> To give you an idea of how slow this is actually: Just reading all the
    > >>
    > >>
    > >>
    > >> --
    > >> Posted via http://www.ruby-forum.com/.
    > >>
    > >>

    > >
    > > Could you not parrallelise the processing of each file? Perhaps using
    > > something like starfish (http://www.rufy.com/starfish/doc/)?

    >
    > Did you mean parrallelizing across multiple files or parrallelizing the
    > processing of one file ?
    >
    > Yes and No. But this involves me getting deeper into a description of my
    > problem:
    >
    > Each file has traceroutes at lots of times t1,t2.... The objective is to
    > collect all traceroutes that happened at that time into one structure.
    > Hence I could do something like this: Using ruby threads (or whatever)
    > read each file, and store it into one *common* hashtable. And then call
    > the syncing of the hashtable to the disk incase it has grown large
    > enough.
    >
    > I was more hoping for some kind of a fast way to readlines, using say
    > mmap or something like that. I read a few posts about how using mmap
    > helped someone else. I will look into this starfish, I rejected ruby
    > threads as unlike pthreads they weren't really true threads.
    >
    > Thanks for replying. Is there something wrong or inherently slow with
    > the things I am doing ?. Can It be speeded up ?.
    >
    > > Farrel

    >
    >
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >


    Also try running your code with some sample data through the ruby
    profiler (just run 'ruby -rprofiler yourcode.rb') and it should give
    you an idea where your program is spending it's time.

    Farrel
     
    Farrel Lifson, Nov 15, 2006
    #7
  8. Devesh Agrawal

    Jan Svitok Guest

    On 11/15/06, Devesh Agrawal <> wrote:
    > Hi,
    > Farrel Lifson wrote:
    > > On 15/11/06, Devesh Agrawal <> wrote:
    > >>
    > >>
    > >> close all files
    > >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
    > >> v. Would coding the realine part in C using rubyinline offer me speed
    > >> the IO would be faster as I could just dump a fixed number of bytes to a
    > >> To give you an idea of how slow this is actually: Just reading all the
    > >>
    > >>
    > >>
    > >> --
    > >> Posted via http://www.ruby-forum.com/.
    > >>
    > >>

    > >
    > > Could you not parrallelise the processing of each file? Perhaps using
    > > something like starfish (http://www.rufy.com/starfish/doc/)?

    >
    > Did you mean parrallelizing across multiple files or parrallelizing the
    > processing of one file ?
    >
    > Yes and No. But this involves me getting deeper into a description of my
    > problem:
    >
    > Each file has traceroutes at lots of times t1,t2.... The objective is to
    > collect all traceroutes that happened at that time into one structure.
    > Hence I could do something like this: Using ruby threads (or whatever)
    > read each file, and store it into one *common* hashtable. And then call
    > the syncing of the hashtable to the disk incase it has grown large
    > enough.
    >
    > I was more hoping for some kind of a fast way to readlines, using say
    > mmap or something like that. I read a few posts about how using mmap
    > helped someone else. I will look into this starfish, I rejected ruby
    > threads as unlike pthreads they weren't really true threads.
    >
    > Thanks for replying. Is there something wrong or inherently slow with
    > the things I am doing ?. Can It be speeded up ?.
    >
    > > Farrel


    Maybe it's possible to preprocess the files with grep or something
    similar (like splitting into hour-long slices, etc.), using ruby just
    for the sorting and merging - that way you could make it parallel, and
    besides, grep&co are supposed to be much faster.
     
    Jan Svitok, Nov 15, 2006
    #8
  9. Devesh Agrawal wrote:
    > I am using ruby to analyse a huge (around 60G) amount of my networking
    > experiment data. Let me briefly describe my technique: I have to read
    > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
    > contains traceroutes to lots of destinations at different times. I.E a
    > file is basically a list of traceroutes launched from a given src (src =
    > filename) launched at diff times. I want to get a structure like
    > following: (list of all traceroutes from *all* src's at time 1), (list
    > of all traceroutes from *all* src's at time 2)... and so on.


    > //Btw I can't use an array instead of a hashtable H, as the P.time's
    > read across all files needn't be same.
    >
    > This is performing miserbly SLOW. I have the following questions:


    First I have a question: why do you read those files in parallel in the
    first place?

    > i. How fast is f.readline ?. I want to use the maximum buffering
    > possible for largest speed gains. In ruby how do I set the buffer size.
    > I looked through io.c, and it seems that readline essentially uses getc
    > (stopping when it gets a newline). How can I set the buffer size for the
    > underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
    >
    > ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
    > even worse.


    No, Marshal is actually pretty fast. It may be due to the other IO you
    do or because of the data you write.

    > iii. Is it bad to have around 40-50 files opened at the same time ?.


    No, but reading from all those files in /parallel/ is. It is of course
    platform dependent how the IO susystem deals with that but chances are
    that the disk heads have to move back and forth between all the files.

    > iv. The program does use a lot of memory but not so much, around 30-40
    > pc of 1G ram machine is used by it. So I think paging in/out is not a
    > problem.


    It is better to not believe but know that paging is not an issue.

    > v. Would coding the realine part in C using rubyinline offer me speed
    > advantages ?


    Very unlikely.

    > vi. I am thinking of trying the following to reduce the time it takes,
    > I would very much welcome your comments:
    >
    > a. Remove Marshal.dump [I don't need to strictly serialize objects,
    > only dump the data and read it back] and replace it with some string
    > form which is more compact. Actually is it possible to have something
    > like fixed length structures like in C: Example I would want P to be
    > like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
    > the IO would be faster as I could just dump a fixed number of bytes to a
    > file.


    Don't do the writing in a temp file while you are reading. This poses
    even more burden on your IO subsystem. Btw, what filesystem do you use?
    You're not happening to be on Suse with ReiserFS?

    > b. Try to reduce the memory consumption of this by reducing k further
    > so as the program doesn't page in/out.


    Without /knowing/ that paging is an issue this does not make sense.

    > c. Can someone point me to a good sample code for reading a file line
    > by line in C and then putting it into a ruby hashtable ?.
    > d. How much of the slowness is due to the fact that it is ruby and not
    > C ?


    Here's my assessment: you do not have a programming language but a
    design problem. Reading from multiple large files at the same time is
    inefficient.

    > To give you an idea of how slow this is actually: Just reading all the
    > files
    > line by line takes around 8-9 hrs.


    You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems
    unrealistically slow. Is this old hardware? Do you have drive
    problems? Are there any other disk intensive tasks going on (a busy DB
    or web server) on that machine?

    > Whereas the above thing easily takes
    > 5-6
    > days !!. And I am quite unable to run profile on my code as it is just
    > too slow.


    Clearly.

    > I would be very grateful for your comments, and particularly if you have
    > any suggestions/experience on doing this in a fast way.


    Here's my advice: rewrite the program to read those files sequentially
    because that is likely faster on most systems. Remove the temp file
    writing while files are read. And find out why your IO system is
    performing so badly.

    Kind regards

    robert
     
    Robert Klemme, Nov 15, 2006
    #9
  10. Devesh Agrawal

    devesh Guest

    Hi Everyone, Thanks for replying.

    First a couple of stupid things that I am doing, then a couple of more
    questions.

    Stupid things:
    i. I parse each line, but that essentially amounts to spliting it and
    such. So maybe I should be better off using scanf kind of stuff and
    also compiling regexps once and using them for each line. I heard that
    helps.

    ii. Going thru my hash H to delete keys is rather stupid. A better way
    is to just dump it and trash it.

    iii. When I said it takes 8-9 hours, I meant 8-9 hours to read each
    line and parse it (with my rather non efficient parse function). And I
    did this one file at a time in sequence

    iv. I will definitely try to profile my code. I did use -rprofile but
    that kind of revealed the obvious that most of my time was spent in
    reading file loop > marshalling > parsing. Plus it was rather slow for
    even 100MB of stuff.

    Oh and btw, the disk I am using is an LVM mapped ext3 local disk. The
    server runs Centos. And I think it is pretty well managed.

    Now my questions:

    i. I was earlier doing this: Read each file in sequence, and then
    write out a similar hash, containing data from one src for each time.
    Even with this hash, I used to dump it to temporary files (one per hash
    key = time instant) and then after doing this for every file (src's). I
    would then merge all this data in the temporary files into what I want.
    This was slow, (but in retrospect not as slow as what I am doing now by
    reading in parrallel). The reason I though opening multiple files and
    reading from them line by line would be faster is cos I wouldn't have
    to open/close all those thousands of temporary files.
    I still don't see the reason why doing things in parrallel would screw
    things up ? As my assumption is that the disk head is anyway pretty
    wild as the system will involve a lot of IO b/w context switches.

    ii. Also why do you think that writing another file in the midst of
    reading (one or more) one is a bad idea ?. The only way I can avoid
    this is to chunk up files into smaller units and then process them,
    writing their results onto temp files and then proceeding along further
    with other files. Which one do you think is a better thing to do ?

    I will now try the following (based on your suggestions):
    i. Avoid heavy weight regexps in parsing lines
    ii. Avoid marshaling/dumping
    iii. Focus on doing things a file at a time.
    iv. If all else fails, I will try using something like sharkfish etc.

    Btw will using something like an mmap extension for ruby speed things
    up for me ?.

    Thanks a lot to all of you. I am sorry if I sound lame, but I am pretty
    new to using ruby.


    On Nov 15, 4:24 pm, Robert Klemme <> wrote:
    > Devesh Agrawal wrote:
    > > I am using ruby to analyse a huge (around 60G) amount of my networking
    > > experiment data. Let me briefly describe my technique: I have to read
    > > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
    > > contains traceroutes to lots of destinations at different times. I.E a
    > > file is basically a list of traceroutes launched from a given src (src =
    > > filename) launched at diff times. I want to get a structure like
    > > following: (list of all traceroutes from *all* src's at time 1), (list
    > > of all traceroutes from *all* src's at time 2)... and so on.
    > > //Btw I can't use an array instead of a hashtable H, as the P.time's
    > > read across all files needn't be same.

    >
    > > This is performing miserbly SLOW. I have the following questions:First I have a question: why do you read those files in parallel in the

    > first place?
    >
    > > i. How fast is f.readline ?. I want to use the maximum buffering
    > > possible for largest speed gains. In ruby how do I set the buffer size.
    > > I looked through io.c, and it seems that readline essentially uses getc
    > > (stopping when it gets a newline). How can I set the buffer size for the
    > > underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

    >
    > > ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
    > > even worse.No, Marshal is actually pretty fast. It may be due to the other IO you

    > do or because of the data you write.
    >
    > > iii. Is it bad to have around 40-50 files opened at the same time ?.No, but reading from all those files in /parallel/ is. It is of course

    > platform dependent how the IO susystem deals with that but chances are
    > that the disk heads have to move back and forth between all the files.
    >
    > > iv. The program does use a lot of memory but not so much, around 30-40
    > > pc of 1G ram machine is used by it. So I think paging in/out is not a
    > > problem.It is better to not believe but know that paging is not an issue.

    >
    > > v. Would coding the realine part in C using rubyinline offer me speed
    > > advantages ?Very unlikely.

    >
    > > vi. I am thinking of trying the following to reduce the time it takes,
    > > I would very much welcome your comments:

    >
    > > a. Remove Marshal.dump [I don't need to strictly serialize objects,
    > > only dump the data and read it back] and replace it with some string
    > > form which is more compact. Actually is it possible to have something
    > > like fixed length structures like in C: Example I would want P to be
    > > like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
    > > the IO would be faster as I could just dump a fixed number of bytes to a
    > > file.Don't do the writing in a temp file while you are reading. This poses

    > even more burden on your IO subsystem. Btw, what filesystem do you use?
    > You're not happening to be on Suse with ReiserFS?
    >
    > > b. Try to reduce the memory consumption of this by reducing k further
    > > so as the program doesn't page in/out.Without /knowing/ that paging is an issue this does not make sense.

    >
    > > c. Can someone point me to a good sample code for reading a file line
    > > by line in C and then putting it into a ruby hashtable ?.
    > > d. How much of the slowness is due to the fact that it is ruby and not
    > > C ?Here's my assessment: you do not have a programming language but a

    > design problem. Reading from multiple large files at the same time is
    > inefficient.
    >
    > > To give you an idea of how slow this is actually: Just reading all the
    > > files
    > > line by line takes around 8-9 hrs.You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems

    > unrealistically slow. Is this old hardware? Do you have drive
    > problems? Are there any other disk intensive tasks going on (a busy DB
    > or web server) on that machine?
    >
    > > Whereas the above thing easily takes

    >
    > > 5-6
    > > days !!. And I am quite unable to run profile on my code as it is just
    > > too slow.Clearly.

    >
    > > I would be very grateful for your comments, and particularly if you have
    > > any suggestions/experience on doing this in a fast way.Here's my advice: rewrite the program to read those files sequentially

    > because that is likely faster on most systems. Remove the temp file
    > writing while files are read. And find out why your IO system is
    > performing so badly.
    >
    > Kind regards
    >
    > robert
     
    devesh, Nov 15, 2006
    #10
  11. Devesh Agrawal

    Guest

    On Thu, 16 Nov 2006, Devesh Agrawal wrote:

    > Hi Folks,
    >
    > I am using ruby to analyse a huge (around 60G) amount of my networking
    > experiment data. Let me briefly describe my technique: I have to read
    > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
    > contains traceroutes to lots of destinations at different times. I.E a
    > file is basically a list of traceroutes launched from a given src (src =
    > filename) launched at diff times. I want to get a structure like
    > following: (list of all traceroutes from *all* src's at time 1), (list
    > of all traceroutes from *all* src's at time 2)... and so on.
    >
    > For this I am using the following psuedocode:
    >
    > outputfile.open
    > open all files f1..fn
    > while (!(all files have eof))
    > (f1..fn).each{|f|
    > next if f.eof
    > line = f.readline
    > parse the line, and get a structure P out of it
    > put P into a hashtable: H[P.time] << P
    >
    > check for eof conditions on f
    >
    > if (H has more than k keys ? (ie has it become very large))
    > H.keys.sort{|t|
    > outputfile << Marshal.dump(H[t])
    > H.delete(t)
    > }
    > end
    > }
    > end
    > close all files
    >
    > //Btw I can't use an array instead of a hashtable H, as the P.time's
    > read across all files needn't be same.
    >
    > This is performing miserbly SLOW. I have the following questions:

    <snip commentary>

    use threads:


    harp:~ > ls -ltarh data/big
    -rw-rw-r-- 1 ahoward ahoward 251M Nov 15 16:41 data/big

    harp:~ > wc -l data/big
    7131456 data/big

    harp:~ > head data/big
    1163633749.75535 = 0.0877033759844916
    1163633749.75565 = 0.913142160532852
    1163633749.75569 = 0.604664544000001
    1163633749.75571 = 0.233515496811041
    1163633749.75574 = 0.221557802561386
    1163633749.75577 = 0.241982349948893
    1163633749.7558 = 0.190149667316971
    1163633749.75583 = 0.0827503931446325
    1163633749.75586 = 0.656736160359522
    1163633749.75588 = 0.222579171509354


    using my threaded program i can process these lines at a rate of about


    harp:~ > ruby a.rb data/big 42 42
    rate : 0.0 lines/second
    rate : 14094.9956719788 lines/second
    rate : 21059.6378674178 lines/second
    rate : 22742.0758748341 lines/second
    rate : 20527.6435560232 lines/second
    rate : 16541.8249949497 lines/second
    rate : 18295.1759919063 lines/second
    rate : 19648.3251130277 lines/second


    so, let's say about 20,000 lines per second - so it'd take about 5 minutes to
    process my 250mb file. i realize yours is more complex, but lets say you
    should be able to do 1gb data on a reasonably fast machine in something like
    20-30 minutes.


    btw. map won't help (much) since you are just doing a sequential read of ALl
    of the data anyhow...



    here's the code



    harp:~ > cat a.rb

    class LineProducer
    require 'thread'
    EOF = nil

    class EOFQueue < ::Queue
    def push arg
    if arg == EOF
    class << self; def pop() EOF end; end
    end
    ensure
    return super
    end
    end

    def initialize path, bufsize
    @path = path
    @bufsize = Integer bufsize
    @lines = EOFQueue.new
    @thread = new_reader
    end

    def join() @thread.join end

    def new_reader
    Thread.new do
    Thread.current.abort_on_exception = true
    buf = ''
    open(@path) do |f|
    while((buf2 = f.read @bufsize))
    buf << buf2
    lines = buf.scan %r/^.*$\n?/
    if lines.last =~ %r/$\n/
    buf = ''
    else
    buf = lines.pop
    end
    @lines.push lines
    end
    end
    @lines.push EOF
    end
    end

    def new_consumer &b
    Thread.new{
    Thread.current.abort_on_exception = true
    while((lines = @lines.pop)); b[ lines ]; end
    }
    end
    end

    class ThreadSafeHash
    require 'sync'
    instance_methods.each{|m| undef_method m unless m[%r/__/]}
    def initialize
    @hash = {}
    @sync = Sync.new
    end
    def method_missing m, *a, &b
    @sync.synchronize{ @hash.send m, *a, &b }
    end
    def respond_to? m, *a, &b
    @sync.synchronize{ @hash.respond_to? m, *a, &b }
    end
    end


    path = ARGV.shift or abort 'no path'
    n = Integer(ARGV.shift || 42)
    hn = Integer(ARGV.shift || 42)
    mb = 2 ** 20

    lines_producer = LineProducer.new path, mb
    tsh = ThreadSafeHash.new

    start = Time.now.to_f
    elapsed = lambda{|start| Time.now.to_f - start.to_f}
    progress = Thread.new{ loop{ puts "rate : #{ tsh.size / elapsed[start] } lines/second"; sleep 2 } }

    parse = lambda{|line| line.split(%r/=/).map{|word| word.strip}}

    threads = Array.new(n).map{
    lines_producer.new_consumer do |lines|
    h = {} and lines.each do |line|
    k, v = parse[ line ]
    h[k] = v
    end
    tsh.update h
    end
    }


    lines_producer.join
    threads.map{|t| t.join}



    regards.



    -a
    --
    my religion is very simple. my religion is kindness. -- the dalai lama
     
    , Nov 15, 2006
    #11
  12. Devesh Agrawal wrote:
    > Hi Folks,
    >
    > I am using ruby to analyse a huge (around 60G) amount of my networking
    > experiment data. Let me briefly describe my technique: I have to read
    > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
    > contains traceroutes to lots of destinations at different times. I.E a
    > file is basically a list of traceroutes launched from a given src (src =
    > filename) launched at diff times. I want to get a structure like
    > following: (list of all traceroutes from *all* src's at time 1), (list
    > of all traceroutes from *all* src's at time 2)... and so on.
    >
    > For this I am using the following psuedocode:
    >
    > outputfile.open
    > open all files f1..fn
    > while (!(all files have eof))
    > (f1..fn).each{|f|
    > next if f.eof
    > line = f.readline
    > parse the line, and get a structure P out of it
    > put P into a hashtable: H[P.time] << P
    >
    > check for eof conditions on f
    >
    > if (H has more than k keys ? (ie has it become very large))
    > H.keys.sort{|t|
    > outputfile << Marshal.dump(H[t])
    > H.delete(t)
    > }
    > end
    > }
    > end
    > close all files
    >
    > //Btw I can't use an array instead of a hashtable H, as the P.time's
    > read across all files needn't be same.
    >
    > This is performing miserbly SLOW. I have the following questions:
    >
    > i. How fast is f.readline ?. I want to use the maximum buffering
    > possible for largest speed gains. In ruby how do I set the buffer size.
    > I looked through io.c, and it seems that readline essentially uses getc
    > (stopping when it gets a newline). How can I set the buffer size for the
    > underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
    >
    > ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
    > even worse.
    >
    > iii. Is it bad to have around 40-50 files opened at the same time ?.
    >
    > iv. The program does use a lot of memory but not so much, around 30-40
    > pc of 1G ram machine is used by it. So I think paging in/out is not a
    > problem.
    >
    > v. Would coding the realine part in C using rubyinline offer me speed
    > advantages ?
    >
    > vi. I am thinking of trying the following to reduce the time it takes,
    > I would very much welcome your comments:
    >
    > a. Remove Marshal.dump [I don't need to strictly serialize objects,
    > only dump the data and read it back] and replace it with some string
    > form which is more compact. Actually is it possible to have something
    > like fixed length structures like in C: Example I would want P to be
    > like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
    > the IO would be faster as I could just dump a fixed number of bytes to a
    > file.
    >
    > b. Try to reduce the memory consumption of this by reducing k further
    > so as the program doesn't page in/out.
    >
    > c. Can someone point me to a good sample code for reading a file line
    > by line in C and then putting it into a ruby hashtable ?.
    > d. How much of the slowness is due to the fact that it is ruby and not
    > C ?
    >
    > To give you an idea of how slow this is actually: Just reading all the
    > files
    > line by line takes around 8-9 hrs. Whereas the above thing easily takes
    > 5-6
    > days !!. And I am quite unable to run profile on my code as it is just
    > too slow.
    >
    > I would be very grateful for your comments, and particularly if you have
    > any suggestions/experience on doing this in a fast way.
    >
    > --Devesh Agrawal
    >

    Basically what you are describing here looks like a poorly specified
    database load followed by queries. Rather than trying to process 60 GB
    of data from a bunch of flat files in and out of data structures in an
    interpreted dynamic garbage collected language, what I would recommend
    is the following:

    1. Pick up the book "Data Crunching" from Pragmatic Programmers. Learn
    about databases, tables and queries.
    2. Choose a convenient database. I think for something this size, your
    options are Microsoft SQL Server Express and PostgreSQL, unless you can
    afford Oracle. This is way too big for MS Access and possibly for MySQL,
    although I haven't tried anything on this scale with MySQL, just SQL
    Server and PostgreSQL.
    3. Write a *simple* line-by-line Ruby script to just parse your data and
    insert the rows into a database. I haven't looked at traceroute data
    recently and I don't quite understand the queries you'd want, so I can't
    give you hints on database design/table structure other than that you
    need one for a problem of this magnitude.
    4. Once you've got the data loaded into the database, there are quite a
    few tools you can use to make the data mining and querying easier.
    Again, since I don't really understand your objectives, I can't pick one
    for you. If you're in the mood to learn Rails, you can probably do some
    interesting things that way.

    I do a lot of stuff like this. Usually I massage the data with Perl into
    a CSV format, then load it manually into PostgreSQL. After that, I do my
    data mining with R using ODBC. But you can do all of it in Ruby. The key
    is to use an industrial strength database for 60 GB worth of data you're
    trying to mine.

    --
    M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
    http://borasky-research.blogspot.com/

    If God had meant for carrots to be eaten cooked, He would have given rabbits fire.
     
    M. Edward (Ed) Borasky, Nov 16, 2006
    #12
  13. On 15.11.2006 23:35, devesh wrote:
    > Hi Everyone, Thanks for replying.
    >
    > First a couple of stupid things that I am doing, then a couple of more
    > questions.
    >
    > Stupid things:
    > i. I parse each line, but that essentially amounts to spliting it and
    > such. So maybe I should be better off using scanf kind of stuff and
    > also compiling regexps once and using them for each line. I heard that
    > helps.


    Using regexps inline is the most efficient way in Ruby. You can find
    numerous benchmark postings in this group.

    > ii. Going thru my hash H to delete keys is rather stupid. A better way
    > is to just dump it and trash it.


    I don't exactly understand what you mean but that's probably because I
    still do not have a clear picture of the processing your are doing.

    > iii. When I said it takes 8-9 hours, I meant 8-9 hours to read each
    > line and parse it (with my rather non efficient parse function). And I
    > did this one file at a time in sequence
    >
    > iv. I will definitely try to profile my code. I did use -rprofile but
    > that kind of revealed the obvious that most of my time was spent in
    > reading file loop > marshalling > parsing. Plus it was rather slow for
    > even 100MB of stuff.


    I don't think you have a CPU issue - this looks rather like an IO issue.

    > Oh and btw, the disk I am using is an LVM mapped ext3 local disk. The
    > server runs Centos. And I think it is pretty well managed.


    Sounds good.

    > Now my questions:
    >
    > i. I was earlier doing this: Read each file in sequence, and then
    > write out a similar hash, containing data from one src for each time.
    > Even with this hash, I used to dump it to temporary files (one per hash
    > key = time instant) and then after doing this for every file (src's). I
    > would then merge all this data in the temporary files into what I want.
    > This was slow, (but in retrospect not as slow as what I am doing now by
    > reading in parrallel). The reason I though opening multiple files and
    > reading from them line by line would be faster is cos I wouldn't have
    > to open/close all those thousands of temporary files.


    Wait, did you say "thousands of temporary files"? Wow, *that* is likely
    going to kill performance.

    > I still don't see the reason why doing things in parrallel would screw
    > things up ? As my assumption is that the disk head is anyway pretty
    > wild as the system will involve a lot of IO b/w context switches.


    That entirely depends on the other processing going on on that machine.
    And there is no point in making it worse by accessing multiple files
    at the same time.

    > ii. Also why do you think that writing another file in the midst of
    > reading (one or more) one is a bad idea ?. The only way I can avoid
    > this is to chunk up files into smaller units and then process them,
    > writing their results onto temp files and then proceeding along further
    > with other files. Which one do you think is a better thing to do ?


    Writing is usually slower than reading. But it entirely depends on what
    you actually do.

    > I will now try the following (based on your suggestions):
    > i. Avoid heavy weight regexps in parsing lines


    Not necessarily a good idea - especially it seems your task is IO bound.
    You can easily verify with "vmstat 5" while you are doing your processing.

    > ii. Avoid marshaling/dumping


    This in itself is not slow. But if you do this into multiple small
    files you'll have problems - but not because of Marshal being slow but
    because of the IO.

    > iii. Focus on doing things a file at a time.
    > iv. If all else fails, I will try using something like sharkfish etc.
    >
    > Btw will using something like an mmap extension for ruby speed things
    > up for me ?.


    I don't believe so because your problem seems to be the IO access pattern.

    > Thanks a lot to all of you. I am sorry if I sound lame, but I am pretty
    > new to using ruby.


    Can you describe in more detail what you are actually doing? Kind of:
    are you combining data from multiple files? How long do you need temp
    data? What do you output - things like that.

    It may actually turn out that Ed's advice to use a RDBMS is the best
    solution because that gives you efficient IO. But to confirm or reject
    that I would have to have a better understanding what business problem
    you are trying to solve.

    Cheers

    robert
     
    Robert Klemme, Nov 16, 2006
    #13
  14. Devesh Agrawal

    Neil Wilson Guest


    > Btw will using something like an mmap extension for ruby speed things
    > up for me ?.
    >


    Can't harm to try. I use it for uploading large files and it helps
    there.

    Unfortunately Operating systems don't always behave like you think they
    ought to when doing something like this. Get to know your profiler and
    brush up on your data processing techniques. When you hit a speed
    problem, you *have* to start and understand what the machine is doing
    with each high level instruction you give it and how they all interact
    with each other.

    NeilW
     
    Neil Wilson, Nov 16, 2006
    #14
  15. On 16.11.2006 10:51, Neil Wilson wrote:
    >> Btw will using something like an mmap extension for ruby speed things
    >> up for me ?.

    >
    > Can't harm to try. I use it for uploading large files and it helps
    > there.


    I beg to differ: it can actually harm to apply an optimization measure
    without making sure that it actually solves the problem. As long as you
    do not know what makes your program slow trying out measures is pretty
    much a waste of time.

    > Unfortunately Operating systems don't always behave like you think they
    > ought to when doing something like this. Get to know your profiler and
    > brush up on your data processing techniques. When you hit a speed
    > problem, you *have* to start and understand what the machine is doing
    > with each high level instruction you give it and how they all interact
    > with each other.


    Exactly. I'd start with vmstat to find out whether we are IO bound or
    CPU bound.

    Regards

    robert
     
    Robert Klemme, Nov 16, 2006
    #15
  16. Robert Klemme wrote:
    > On 16.11.2006 10:51, Neil Wilson wrote:
    >>> Btw will using something like an mmap extension for ruby speed things
    >>> up for me ?.

    >>
    >> Can't harm to try. I use it for uploading large files and it helps
    >> there.

    >
    > I beg to differ: it can actually harm to apply an optimization measure
    > without making sure that it actually solves the problem. As long as
    > you do not know what makes your program slow trying out measures is
    > pretty much a waste of time.
    >
    >> Unfortunately Operating systems don't always behave like you think they
    >> ought to when doing something like this. Get to know your profiler and
    >> brush up on your data processing techniques. When you hit a speed
    >> problem, you *have* to start and understand what the machine is doing
    >> with each high level instruction you give it and how they all interact
    >> with each other.

    >
    > Exactly. I'd start with vmstat to find out whether we are IO bound or
    > CPU bound.
    >
    > Regards
    >
    > robert

    Just for the sake of amusement, I did a traceroute to see what the
    output looks like:

    $ traceroute -n rubyforge.org
    traceroute to rubyforge.org (205.234.109.18), 30 hops max, 38 byte packets
    1 192.168.0.1 4.080 ms 2.845 ms 1.932 ms
    2 * * *
    3 68.87.218.17 9.540 ms 7.114 ms 7.187 ms
    4 68.87.216.29 8.025 ms * 9.980 ms
    5 12.116.188.9 21.930 ms 21.446 ms 21.619 ms
    6 12.123.12.126 23.060 ms 21.566 ms 21.739 ms
    MPLS Label=31697 CoS=0 TTL=1 S=1
    7 12.123.13.185 22.617 ms 22.220 ms 20.848 ms
    8 206.24.211.1 21.786 ms 22.516 ms 20.589 ms
    9 204.70.194.50 30.330 ms 28.985 ms 204.70.192.113 106.454 ms
    MPLS Label=202112 CoS=0 TTL=1 S=1
    10 204.70.192.85 62.240 ms 204.70.192.134 87.983 ms 86.297 ms
    MPLS Label=509440 CoS=0 TTL=1 S=1
    11 204.70.192.50 64.302 ms 204.70.192.25 104.547 ms 204.70.192.50
    65.496 ms
    MPLS Label=190688 CoS=0 TTL=1 S=1
    12 204.70.192.69 81.608 ms 208.173.10.201 105.080 ms 204.70.192.69
    82.155 ms
    MPLS Label=131056 CoS=0 TTL=1 S=1
    13 208.173.50.154 104.882 ms 136.592 ms 204.70.192.62 95.377 ms
    14 205.234.109.18 101.528 ms !<10> 206.24.238.218 100.340 ms 100.945 ms
    15 205.234.109.18 100.920 ms !<10> 208.173.50.154 100.550 ms
    205.234.109.18 100.802 ms !<10>

    So ... we have *one* traceroute from point a (my system) to point b
    (rubyforge.org). I'm assuming the original poster has some simple shell
    script that does a loop forever around something like "date; traceroute
    -n ip.add.re.ss; sleep 300" to get a log of timestamped traceroutes
    every five minutes. Then he wants to collect hundreds of these lines
    from many files (60 GB total) into a huge in-core Ruby structure, then
    make some kind of report from it. It got so big from the naive approach
    that he started marshalling it out of core in chunks, leading to the
    code he posted here.

    A little arithmetic: that single traceroute above is about 1000 bytes.
    60 GB is 60 million times 1000 bytes, which means our collection of 60
    GB of traceroute data has about 60 million individual traceroutes,
    assuming they are as long as my off-the-wall traceroute over the open
    internet to rubyforge. Obviously they will be much smaller if done on a
    well-designed corporate intranet, which means there are many more of them.

    In any event, it looks to me like he is trying to reinvent some network
    monitoring wheel for which there are undoubtedly both expensive
    commercial packages and wonderful full-featured open source tools
    available. I wrote a lot of code like this ten years ago (in ksh, awk
    and Perl 4) when such tools weren't available. Now they are, and there
    are Ruby interfaces to most of them. So in addition to buying and
    learning from "Data Crunching", I'm going to recommend the original
    poster head over to

    http://oss.oetiker.ch/rrdtool/

    and do some window shopping. IIRC there is a Ruby interface to RRDTool,
    although I've forgotten whether it's currently actively maintained -- it
    wasn't a year or so ago when I first started investigating Ruby. And
    there are definitely tools to migrate a "native" round robin database
    into an "ordinary" relational database.

    --
    M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
    http://borasky-research.blogspot.com/

    If God had meant for carrots to be eaten cooked, He would have given rabbits fire.
     
    M. Edward (Ed) Borasky, Nov 16, 2006
    #16
  17. On 16.11.2006 16:42, M. Edward (Ed) Borasky wrote:
    > It got so big from the naive approach
    > that he started marshalling it out of core in chunks, leading to the
    > code he posted here.


    Which code? I saw only the pseudo code from the first posting. Is
    there something that hasn't made it through the gateway again?

    > In any event, it looks to me like he is trying to reinvent some network
    > monitoring wheel for which there are undoubtedly both expensive
    > commercial packages and wonderful full-featured open source tools
    > available. I wrote a lot of code like this ten years ago (in ksh, awk
    > and Perl 4) when such tools weren't available. Now they are, and there
    > are Ruby interfaces to most of them.


    This makes perfectly sense to me.

    Kind regards

    robert
     
    Robert Klemme, Nov 16, 2006
    #17
  18. Robert Klemme wrote:
    > On 16.11.2006 16:42, M. Edward (Ed) Borasky wrote:
    >> It got so big from the naive approach that he started marshalling it
    >> out of core in chunks, leading to the code he posted here.

    >
    > Which code? I saw only the pseudo code from the first posting. Is
    > there something that hasn't made it through the gateway again?

    I was referring to the pseudo-code.

    --
    M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
    http://borasky-research.blogspot.com/

    If God had meant for carrots to be eaten cooked, He would have given rabbits fire.
     
    M. Edward (Ed) Borasky, Nov 17, 2006
    #18
  19. Devesh Agrawal

    Guest

    On Fri, 17 Nov 2006, M. Edward (Ed) Borasky wrote:

    > If God had meant for carrots to be eaten cooked, He would have given rabbits
    > fire.


    ed. that quote is aweomse.

    -a
    --
    my religion is very simple. my religion is kindness. -- the dalai lama
     
    , Nov 17, 2006
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PedroX
    Replies:
    9
    Views:
    1,580
    Bryce K. Nielsen
    Jun 28, 2005
  2. Yi Xing
    Replies:
    6
    Views:
    489
    Simon Forman
    Jul 26, 2006
  3. Thomas Kowalski
    Replies:
    6
    Views:
    836
    Branimir Maksimovic
    May 18, 2007
  4. EJP
    Replies:
    3
    Views:
    563
  5. Catherine Moroney

    fast copying of large files in python

    Catherine Moroney, Nov 2, 2011, in forum: Python
    Replies:
    1
    Views:
    993
    Dave Angel
    Nov 2, 2011
Loading...

Share This Page