Fast way to process large files line by line

Devesh Agrawal · Nov 15, 2006

Hi Folks,

I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

For this I am using the following psuedocode:

outputfile.open
open all files f1..fn
while (!(all files have eof))
(f1..fn).each{|f|
next if f.eof
line = f.readline
parse the line, and get a structure P out of it
put P into a hashtable: H[P.time] << P

check for eof conditions on f

if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
}
end
close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

iii. Is it bad to have around 40-50 files opened at the same time ?.

iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.

v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.

b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.

c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
d. How much of the slowness is due to the fact that it is ruby and not
C ?

To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.

--Devesh Agrawal

Farrel Lifson · Nov 15, 2006

Hi Folks,

I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

For this I am using the following psuedocode:

outputfile.open
open all files f1..fn
while (!(all files have eof))
(f1..fn).each{|f|
next if f.eof
line = f.readline
parse the line, and get a structure P out of it
put P into a hashtable: H[P.time] << P

check for eof conditions on f

if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
}
end
close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

iii. Is it bad to have around 40-50 files opened at the same time ?.

iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.

v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.

b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.

c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
d. How much of the slowness is due to the fact that it is ruby and not
C ?

To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.

--Devesh Agrawal

Could you not parrallelise the processing of each file? Perhaps using
something like starfish (http://www.rufy.com/starfish/doc/)?

Farrel

Devesh Agrawal · Nov 15, 2006

Hi,

Farrel said:
Could you not parrallelise the processing of each file? Perhaps using
something like starfish (http://www.rufy.com/starfish/doc/)?

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

Farrel Lifson · Nov 15, 2006

Hi,

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

I think this
if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
can just be changed to
if (H has more than k keys ?)
outputfile << Marshal.dump(H)
end
H = {}

Farrel

Farrel Lifson · Nov 15, 2006

Hi,

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

Click to expand...

I think this
if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
can just be changed to
if (H has more than k keys ?)
outputfile << Marshal.dump(H)
end
H = {}

Farrel

Whoops! make that
if (H has more than k keys ?)
outputfile << Marshal.dump(H)
H={}
end

Eric Hodel · Nov 15, 2006

Hi Folks,

I am using ruby to analyse a huge (around 60G) amount of my
networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src
(src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

For this I am using the following psuedocode:

outputfile.open
open all files f1..fn
while (!(all files have eof))
(f1..fn).each{|f|
next if f.eof
line = f.readline
parse the line, and get a structure P out of it
put P into a hashtable: H[P.time] << P

check for eof conditions on f

if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
}
end
close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

Have you profiled? Where is your time really coming from?

Repost with a profile and then we can give some real suggestions.

i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer
size.
I looked through io.c, and it seems that readline essentially uses
getc
(stopping when it gets a newline). How can I set the buffer size
for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

I seriously doubt that this is your choke-point.

ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

Marshal.dump is pretty fast, probably as fast as you're going to get
for a serialization format. _why did some benchmarks back in the day
and it beat out the other P languages.

That said, why are you even using it? Why not just add raw strings?

v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

No.

(or, very unlikely)

vi. I am thinking of trying the following to reduce the time it
takes,
I would very much welcome your comments:

Profile, profile, profile.

a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes
to a
file.

Yes, do this, simpler is better.

Try #pack and #unpack.

b. Try to reduce the memory consumption of this by reducing k
further so as the program doesn't page in/out.

You already said it isn't paging...

c. Can someone point me to a good sample code for reading a file
line by line in C and then putting it into a ruby hashtable ?.

No. Profile, profile, profile.

d. How much of the slowness is due to the fact that it is ruby
and not C ?

We can't tell you without a profile. Profile, profile, profile.

To give you an idea of how slow this is actually: Just reading all the
files line by line takes around 8-9 hrs. Whereas the above thing
easily takes
5-6 days !!. And I am quite unable to run profile on my code as it
is just
too slow.

Lies.

Use a reduced dataset and with ruby-prof or zenprofile.

You know nothing without a profile.

I would be very grateful for your comments, and particularly if you
have
any suggestions/experience on doing this in a fast way.

Profile it, you can't make sane changes without one.

Farrel Lifson · Nov 15, 2006

Hi,

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

Also try running your code with some sample data through the ruby
profiler (just run 'ruby -rprofiler yourcode.rb') and it should give
you an idea where your program is spending it's time.

Farrel

Jan Svitok · Nov 15, 2006

Hi,

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

Maybe it's possible to preprocess the files with grep or something
similar (like splitting into hour-long slices, etc.), using ruby just
for the sorting and merging - that way you could make it parallel, and
besides, grep&co are supposed to be much faster.

Robert Klemme · Nov 15, 2006

Devesh said:
I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

First I have a question: why do you read those files in parallel in the
first place?

i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

No, Marshal is actually pretty fast. It may be due to the other IO you
do or because of the data you write.

iii. Is it bad to have around 40-50 files opened at the same time ?.

No, but reading from all those files in /parallel/ is. It is of course
platform dependent how the IO susystem deals with that but chances are
that the disk heads have to move back and forth between all the files.

iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.

It is better to not believe but know that paging is not an issue.

v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

Very unlikely.

vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.

Don't do the writing in a temp file while you are reading. This poses
even more burden on your IO subsystem. Btw, what filesystem do you use?
You're not happening to be on Suse with ReiserFS?

b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.

Without /knowing/ that paging is an issue this does not make sense.

c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
d. How much of the slowness is due to the fact that it is ruby and not
C ?

Here's my assessment: you do not have a programming language but a
design problem. Reading from multiple large files at the same time is
inefficient.

To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs.

You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems
unrealistically slow. Is this old hardware? Do you have drive
problems? Are there any other disk intensive tasks going on (a busy DB
or web server) on that machine?

Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.
Clearly.

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.

Here's my advice: rewrite the program to read those files sequentially
because that is likely faster on most systems. Remove the temp file
writing while files are read. And find out why your IO system is
performing so badly.

Kind regards

robert

devesh · Nov 15, 2006

Hi Everyone, Thanks for replying.

First a couple of stupid things that I am doing, then a couple of more
questions.

Stupid things:
i. I parse each line, but that essentially amounts to spliting it and
such. So maybe I should be better off using scanf kind of stuff and
also compiling regexps once and using them for each line. I heard that
helps.

ii. Going thru my hash H to delete keys is rather stupid. A better way
is to just dump it and trash it.

iii. When I said it takes 8-9 hours, I meant 8-9 hours to read each
line and parse it (with my rather non efficient parse function). And I
did this one file at a time in sequence

iv. I will definitely try to profile my code. I did use -rprofile but
that kind of revealed the obvious that most of my time was spent in
reading file loop > marshalling > parsing. Plus it was rather slow for
even 100MB of stuff.

Oh and btw, the disk I am using is an LVM mapped ext3 local disk. The
server runs Centos. And I think it is pretty well managed.

Now my questions:

i. I was earlier doing this: Read each file in sequence, and then
write out a similar hash, containing data from one src for each time.
Even with this hash, I used to dump it to temporary files (one per hash
key = time instant) and then after doing this for every file (src's). I
would then merge all this data in the temporary files into what I want.
This was slow, (but in retrospect not as slow as what I am doing now by
reading in parrallel). The reason I though opening multiple files and
reading from them line by line would be faster is cos I wouldn't have
to open/close all those thousands of temporary files.
I still don't see the reason why doing things in parrallel would screw
things up ? As my assumption is that the disk head is anyway pretty
wild as the system will involve a lot of IO b/w context switches.

ii. Also why do you think that writing another file in the midst of
reading (one or more) one is a bad idea ?. The only way I can avoid
this is to chunk up files into smaller units and then process them,
writing their results onto temp files and then proceeding along further
with other files. Which one do you think is a better thing to do ?

I will now try the following (based on your suggestions):
i. Avoid heavy weight regexps in parsing lines
ii. Avoid marshaling/dumping
iii. Focus on doing things a file at a time.
iv. If all else fails, I will try using something like sharkfish etc.

Btw will using something like an mmap extension for ruby speed things
up for me ?.

Thanks a lot to all of you. I am sorry if I sound lame, but I am pretty
new to using ruby.

Devesh said:
Devesh said:

I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.
//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

Click to expand...

This is performing miserbly SLOW. I have the following questions:First I have a question: why do you read those files in parallel in the

Click to expand...

first place?

i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

Click to expand...

ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.No, Marshal is actually pretty fast. It may be due to the other IO you

Click to expand...

do or because of the data you write.

iii. Is it bad to have around 40-50 files opened at the same time ?.No, but reading from all those files in /parallel/ is. It is of course

Click to expand...

platform dependent how the IO susystem deals with that but chances are
that the disk heads have to move back and forth between all the files.

iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.It is better to not believe but know that paging is not an issue.

Click to expand...

v. Would coding the realine part in C using rubyinline offer me speed
advantages ?Very unlikely.

Click to expand...

vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

Click to expand...

a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.Don't do the writing in a temp file while you are reading. This poses

Click to expand...

even more burden on your IO subsystem. Btw, what filesystem do you use?
You're not happening to be on Suse with ReiserFS?

b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.Without /knowing/ that paging is an issue this does not make sense.

Click to expand...

c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
d. How much of the slowness is due to the fact that it is ruby and not
C ?Here's my assessment: you do not have a programming language but a

Click to expand...

design problem. Reading from multiple large files at the same time is
inefficient.

To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs.You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems

Click to expand...

unrealistically slow. Is this old hardware? Do you have drive
problems? Are there any other disk intensive tasks going on (a busy DB
or web server) on that machine?

Whereas the above thing easily takes

Click to expand...

5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.Clearly.

Click to expand...

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.Here's my advice: rewrite the program to read those files sequentially

Click to expand...

because that is likely faster on most systems. Remove the temp file
writing while files are read. And find out why your IO system is
performing so badly.

Kind regards

robert

ara.t.howard · Nov 15, 2006

Hi Folks,

I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

For this I am using the following psuedocode:

outputfile.open
open all files f1..fn
while (!(all files have eof))
(f1..fn).each{|f|
next if f.eof
line = f.readline
parse the line, and get a structure P out of it
put P into a hashtable: H[P.time] << P

check for eof conditions on f

if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
}
end
close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

<snip commentary>

use threads:

harp:~ > ls -ltarh data/big
-rw-rw-r-- 1 ahoward ahoward 251M Nov 15 16:41 data/big

harp:~ > wc -l data/big
7131456 data/big

harp:~ > head data/big
1163633749.75535 = 0.0877033759844916
1163633749.75565 = 0.913142160532852
1163633749.75569 = 0.604664544000001
1163633749.75571 = 0.233515496811041
1163633749.75574 = 0.221557802561386
1163633749.75577 = 0.241982349948893
1163633749.7558 = 0.190149667316971
1163633749.75583 = 0.0827503931446325
1163633749.75586 = 0.656736160359522
1163633749.75588 = 0.222579171509354

using my threaded program i can process these lines at a rate of about

harp:~ > ruby a.rb data/big 42 42
rate : 0.0 lines/second
rate : 14094.9956719788 lines/second
rate : 21059.6378674178 lines/second
rate : 22742.0758748341 lines/second
rate : 20527.6435560232 lines/second
rate : 16541.8249949497 lines/second
rate : 18295.1759919063 lines/second
rate : 19648.3251130277 lines/second

so, let's say about 20,000 lines per second - so it'd take about 5 minutes to
process my 250mb file. i realize yours is more complex, but lets say you
should be able to do 1gb data on a reasonably fast machine in something like
20-30 minutes.

btw. map won't help (much) since you are just doing a sequential read of ALl
of the data anyhow...

here's the code

harp:~ > cat a.rb

class LineProducer
require 'thread'
EOF = nil

class EOFQueue < ::Queue
def push arg
if arg == EOF
class << self; def pop() EOF end; end
end
ensure
return super
end
end

def initialize path, bufsize
@path = path
@bufsize = Integer bufsize
@lines = EOFQueue.new
@thread = new_reader
end

def join() @thread.join end

def new_reader
Thread.new do
Thread.current.abort_on_exception = true
buf = ''
open(@path) do |f|
while((buf2 = f.read @bufsize))
buf << buf2
lines = buf.scan %r/^.*$\n?/
if lines.last =~ %r/$\n/
buf = ''
else
buf = lines.pop
end
@lines.push lines
end
end
@lines.push EOF
end
end

def new_consumer &b
Thread.new{
Thread.current.abort_on_exception = true
while((lines = @lines.pop)); b[ lines ]; end
}
end
end

class ThreadSafeHash
require 'sync'
instance_methods.each{|m| undef_method m unless m[%r/__/]}
def initialize
@hash = {}
@sync = Sync.new
end
def method_missing m, *a, &b
@sync.synchronize{ @hash.send m, *a, &b }
end
def respond_to? m, *a, &b
@sync.synchronize{ @hash.respond_to? m, *a, &b }
end
end

path = ARGV.shift or abort 'no path'
n = Integer(ARGV.shift || 42)
hn = Integer(ARGV.shift || 42)
mb = 2 ** 20

lines_producer = LineProducer.new path, mb
tsh = ThreadSafeHash.new

start = Time.now.to_f
elapsed = lambda{|start| Time.now.to_f - start.to_f}
progress = Thread.new{ loop{ puts "rate : #{ tsh.size / elapsed[start] } lines/second"; sleep 2 } }

parse = lambda{|line| line.split(%r/=/).map{|word| word.strip}}

threads = Array.new(n).map{
lines_producer.new_consumer do |lines|
h = {} and lines.each do |line|
k, v = parse[ line ]
h[k] = v
end
tsh.update h
end
}

lines_producer.join
threads.map{|t| t.join}

regards.

-a

M. Edward (Ed) Borasky · Nov 16, 2006

Devesh said:
Hi Folks,

I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

For this I am using the following psuedocode:

outputfile.open
open all files f1..fn
while (!(all files have eof))
(f1..fn).each{|f|
next if f.eof
line = f.readline
parse the line, and get a structure P out of it
put P into a hashtable: H[P.time] << P

check for eof conditions on f

if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
}
end
close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

iii. Is it bad to have around 40-50 files opened at the same time ?.

iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.

v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.

b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.

c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
d. How much of the slowness is due to the fact that it is ruby and not
C ?

To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.

--Devesh Agrawal

Basically what you are describing here looks like a poorly specified
database load followed by queries. Rather than trying to process 60 GB
of data from a bunch of flat files in and out of data structures in an
interpreted dynamic garbage collected language, what I would recommend
is the following:

1. Pick up the book "Data Crunching" from Pragmatic Programmers. Learn
about databases, tables and queries.
2. Choose a convenient database. I think for something this size, your
options are Microsoft SQL Server Express and PostgreSQL, unless you can
afford Oracle. This is way too big for MS Access and possibly for MySQL,
although I haven't tried anything on this scale with MySQL, just SQL
Server and PostgreSQL.
3. Write a *simple* line-by-line Ruby script to just parse your data and
insert the rows into a database. I haven't looked at traceroute data
recently and I don't quite understand the queries you'd want, so I can't
give you hints on database design/table structure other than that you
need one for a problem of this magnitude.
4. Once you've got the data loaded into the database, there are quite a
few tools you can use to make the data mining and querying easier.
Again, since I don't really understand your objectives, I can't pick one
for you. If you're in the mood to learn Rails, you can probably do some
interesting things that way.

I do a lot of stuff like this. Usually I massage the data with Perl into
a CSV format, then load it manually into PostgreSQL. After that, I do my
data mining with R using ODBC. But you can do all of it in Ruby. The key
is to use an industrial strength database for 60 GB worth of data you're
trying to mine.

Robert Klemme · Nov 16, 2006

Hi Everyone, Thanks for replying.

First a couple of stupid things that I am doing, then a couple of more
questions.

Stupid things:
i. I parse each line, but that essentially amounts to spliting it and
such. So maybe I should be better off using scanf kind of stuff and
also compiling regexps once and using them for each line. I heard that
helps.

Using regexps inline is the most efficient way in Ruby. You can find
numerous benchmark postings in this group.

ii. Going thru my hash H to delete keys is rather stupid. A better way
is to just dump it and trash it.

I don't exactly understand what you mean but that's probably because I
still do not have a clear picture of the processing your are doing.

iii. When I said it takes 8-9 hours, I meant 8-9 hours to read each
line and parse it (with my rather non efficient parse function). And I
did this one file at a time in sequence

iv. I will definitely try to profile my code. I did use -rprofile but
that kind of revealed the obvious that most of my time was spent in
reading file loop > marshalling > parsing. Plus it was rather slow for
even 100MB of stuff.

I don't think you have a CPU issue - this looks rather like an IO issue.

Oh and btw, the disk I am using is an LVM mapped ext3 local disk. The
server runs Centos. And I think it is pretty well managed.

Sounds good.

Now my questions:

i. I was earlier doing this: Read each file in sequence, and then
write out a similar hash, containing data from one src for each time.
Even with this hash, I used to dump it to temporary files (one per hash
key = time instant) and then after doing this for every file (src's). I
would then merge all this data in the temporary files into what I want.
This was slow, (but in retrospect not as slow as what I am doing now by
reading in parrallel). The reason I though opening multiple files and
reading from them line by line would be faster is cos I wouldn't have
to open/close all those thousands of temporary files.

Wait, did you say "thousands of temporary files"? Wow, *that* is likely
going to kill performance.

I still don't see the reason why doing things in parrallel would screw
things up ? As my assumption is that the disk head is anyway pretty
wild as the system will involve a lot of IO b/w context switches.

That entirely depends on the other processing going on on that machine.
And there is no point in making it worse by accessing multiple files
at the same time.

ii. Also why do you think that writing another file in the midst of
reading (one or more) one is a bad idea ?. The only way I can avoid
this is to chunk up files into smaller units and then process them,
writing their results onto temp files and then proceeding along further
with other files. Which one do you think is a better thing to do ?

Writing is usually slower than reading. But it entirely depends on what
you actually do.

I will now try the following (based on your suggestions):
i. Avoid heavy weight regexps in parsing lines

Not necessarily a good idea - especially it seems your task is IO bound.
You can easily verify with "vmstat 5" while you are doing your processing.

ii. Avoid marshaling/dumping

This in itself is not slow. But if you do this into multiple small
files you'll have problems - but not because of Marshal being slow but
because of the IO.

iii. Focus on doing things a file at a time.
iv. If all else fails, I will try using something like sharkfish etc.

Btw will using something like an mmap extension for ruby speed things
up for me ?.

I don't believe so because your problem seems to be the IO access pattern.

Thanks a lot to all of you. I am sorry if I sound lame, but I am pretty
new to using ruby.

Can you describe in more detail what you are actually doing? Kind of:
are you combining data from multiple files? How long do you need temp
data? What do you output - things like that.

It may actually turn out that Ed's advice to use a RDBMS is the best
solution because that gives you efficient IO. But to confirm or reject
that I would have to have a better understanding what business problem
you are trying to solve.

Cheers

robert

Neil Wilson · Nov 16, 2006

Btw will using something like an mmap extension for ruby speed things
up for me ?.

Can't harm to try. I use it for uploading large files and it helps
there.

Unfortunately Operating systems don't always behave like you think they
ought to when doing something like this. Get to know your profiler and
brush up on your data processing techniques. When you hit a speed
problem, you *have* to start and understand what the machine is doing
with each high level instruction you give it and how they all interact
with each other.

NeilW

Robert Klemme · Nov 16, 2006

Can't harm to try. I use it for uploading large files and it helps
there.

I beg to differ: it can actually harm to apply an optimization measure
without making sure that it actually solves the problem. As long as you
do not know what makes your program slow trying out measures is pretty
much a waste of time.

Unfortunately Operating systems don't always behave like you think they
ought to when doing something like this. Get to know your profiler and
brush up on your data processing techniques. When you hit a speed
problem, you *have* to start and understand what the machine is doing
with each high level instruction you give it and how they all interact
with each other.

Exactly. I'd start with vmstat to find out whether we are IO bound or
CPU bound.

Regards

robert

M. Edward (Ed) Borasky · Nov 16, 2006

Robert said:
I beg to differ: it can actually harm to apply an optimization measure
without making sure that it actually solves the problem. As long as
you do not know what makes your program slow trying out measures is
pretty much a waste of time.

Exactly. I'd start with vmstat to find out whether we are IO bound or
CPU bound.

Regards

robert

Just for the sake of amusement, I did a traceroute to see what the
output looks like:

$ traceroute -n rubyforge.org
traceroute to rubyforge.org (205.234.109.18), 30 hops max, 38 byte packets
1 192.168.0.1 4.080 ms 2.845 ms 1.932 ms
2 * * *
3 68.87.218.17 9.540 ms 7.114 ms 7.187 ms
4 68.87.216.29 8.025 ms * 9.980 ms
5 12.116.188.9 21.930 ms 21.446 ms 21.619 ms
6 12.123.12.126 23.060 ms 21.566 ms 21.739 ms
MPLS Label=31697 CoS=0 TTL=1 S=1
7 12.123.13.185 22.617 ms 22.220 ms 20.848 ms
8 206.24.211.1 21.786 ms 22.516 ms 20.589 ms
9 204.70.194.50 30.330 ms 28.985 ms 204.70.192.113 106.454 ms
MPLS Label=202112 CoS=0 TTL=1 S=1
10 204.70.192.85 62.240 ms 204.70.192.134 87.983 ms 86.297 ms
MPLS Label=509440 CoS=0 TTL=1 S=1
11 204.70.192.50 64.302 ms 204.70.192.25 104.547 ms 204.70.192.50
65.496 ms
MPLS Label=190688 CoS=0 TTL=1 S=1
12 204.70.192.69 81.608 ms 208.173.10.201 105.080 ms 204.70.192.69
82.155 ms
MPLS Label=131056 CoS=0 TTL=1 S=1
13 208.173.50.154 104.882 ms 136.592 ms 204.70.192.62 95.377 ms
14 205.234.109.18 101.528 ms !<10> 206.24.238.218 100.340 ms 100.945 ms
15 205.234.109.18 100.920 ms !<10> 208.173.50.154 100.550 ms
205.234.109.18 100.802 ms !<10>

So ... we have *one* traceroute from point a (my system) to point b
(rubyforge.org). I'm assuming the original poster has some simple shell
script that does a loop forever around something like "date; traceroute
-n ip.add.re.ss; sleep 300" to get a log of timestamped traceroutes
every five minutes. Then he wants to collect hundreds of these lines
from many files (60 GB total) into a huge in-core Ruby structure, then
make some kind of report from it. It got so big from the naive approach
that he started marshalling it out of core in chunks, leading to the
code he posted here.

A little arithmetic: that single traceroute above is about 1000 bytes.
60 GB is 60 million times 1000 bytes, which means our collection of 60
GB of traceroute data has about 60 million individual traceroutes,
assuming they are as long as my off-the-wall traceroute over the open
internet to rubyforge. Obviously they will be much smaller if done on a
well-designed corporate intranet, which means there are many more of them.

In any event, it looks to me like he is trying to reinvent some network
monitoring wheel for which there are undoubtedly both expensive
commercial packages and wonderful full-featured open source tools
available. I wrote a lot of code like this ten years ago (in ksh, awk
and Perl 4) when such tools weren't available. Now they are, and there
are Ruby interfaces to most of them. So in addition to buying and
learning from "Data Crunching", I'm going to recommend the original
poster head over to

http://oss.oetiker.ch/rrdtool/

and do some window shopping. IIRC there is a Ruby interface to RRDTool,
although I've forgotten whether it's currently actively maintained -- it
wasn't a year or so ago when I first started investigating Ruby. And
there are definitely tools to migrate a "native" round robin database
into an "ordinary" relational database.

Robert Klemme · Nov 16, 2006

It got so big from the naive approach
that he started marshalling it out of core in chunks, leading to the
code he posted here.

Which code? I saw only the pseudo code from the first posting. Is
there something that hasn't made it through the gateway again?

In any event, it looks to me like he is trying to reinvent some network
monitoring wheel for which there are undoubtedly both expensive
commercial packages and wonderful full-featured open source tools
available. I wrote a lot of code like this ten years ago (in ksh, awk
and Perl 4) when such tools weren't available. Now they are, and there
are Ruby interfaces to most of them.

This makes perfectly sense to me.

Kind regards

robert

M. Edward (Ed) Borasky · Nov 17, 2006

Robert said:
Which code? I saw only the pseudo code from the first posting. Is
there something that hasn't made it through the gateway again?

I was referring to the pseudo-code.

ara.t.howard · Nov 17, 2006

If God had meant for carrots to be eaten cooked, He would have given rabbits
fire.

ed. that quote is aweomse.

-a

Windows command line to python	0	Sep 29, 2021
Command Line Arguments	0	Mar 7, 2023
Desmos is insanely fast	8	Jul 12, 2023
Optimal way to make a table for large lists	2	Jul 7, 2022
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Hover state on an element stuttering when I'm close to the edge, or move my mouse really fast	1	Feb 2, 2023
Fast searching of large files	6	Jul 1, 2010
fast copying of large files in python	1	Nov 2, 2011

Fast way to process large files line by line

Devesh Agrawal

Farrel Lifson

Devesh Agrawal

Farrel Lifson

Farrel Lifson

Eric Hodel

Farrel Lifson

Jan Svitok

Robert Klemme

devesh

ara.t.howard

M. Edward (Ed) Borasky

Robert Klemme

Neil Wilson

Robert Klemme

M. Edward (Ed) Borasky

Robert Klemme

M. Edward (Ed) Borasky

ara.t.howard

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

*Fast* way to process large files line by line

Devesh Agrawal

Farrel Lifson

Devesh Agrawal

Farrel Lifson

Farrel Lifson

Eric Hodel

Farrel Lifson

Jan Svitok

Robert Klemme

devesh

ara.t.howard

M. Edward (Ed) Borasky

Robert Klemme

Neil Wilson

Robert Klemme

M. Edward (Ed) Borasky

Robert Klemme

M. Edward (Ed) Borasky

ara.t.howard

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

Fast way to process large files line by line