my ruby code won't go as fast as my perl code

D

Dave Burt

I realise I'm doing this a perlish way, but my question is, is it possible
to do this operation in Ruby in a time more comparable to what the Perl
version's getting? (That's about 4 seconds; my Ruby code runs in about 17
seconds over the same data set, which is far smaller than the production
data set.)

Basically, we have CSV files with a date like 31-DEC-03 23:59:59 as the
first field (always in order), and the task is to grab into an array (to
later process further) just the parts of each file that fall after a given
date.

The main slow bit seems to be the string concatenation and comparison
(...+$4+$5+$6 >= start_date).

################################################################

#!perl

$start_date = '20040000000000'; # "yyyymmddhhmmss"
$dir = "data";

@months = qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);
%mm = {};
for ($i = 0; $i < 12; $i++) {
$mm{$months[$i]} = sprintf('%.2d', $i)
}
undef @months;

@a = ();

opendir DIR, $dir;
while ($_ = readdir DIR) {
next if /^\./; # skip dotfiles
open IN, "$dir/$_";
while (<IN>) {
/^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/;
$cc = ($3 ge '87' ? '19' : '20');
if ("$cc$3$mm{$2}$1$4$5$6" ge $start_date) {
while (<IN>) {
push @a, $_;
}
}
}
close IN;
}
closedir DIR;

$t = time - $t;
print "Read " . scalar(@a) . " lines in $t seconds$/"; # 4 seconds

$t = time;
open OUT, ">perl.out";
print OUT @a;
$t = time - $t;
print "Wrote in $t seconds$/"; # 3 seconds

################################################################

#!ruby

mm = Hash.new
i = '00'
%w(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC).each do |mmm|
mm[mmm] = i = i.succ
end

start_date = '20040000000000' # "yyyymmddhhmmss"
dir = "data"

date_regex = /^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/
a = []
t = Time.new
reading = false

Dir.open(dir).each do |file|
next if file[0] == ?. # skip dotfiles
reading = false
File.open(dir + '/' + file).each_line do |line|
reading ||= (date_regex =~ line &&
(($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6 >= start_date))
a << line if reading
end
end

t = Time.new - t;
puts "Read #{a.size} lines in #{t} seconds"; # 17 seconds

t = Time.new
File.open('ruby.out', 'w') do |f|
f.print a.join
end
t = Time.new - t;
puts "Wrote #{a.size} lines in #{t} seconds"; # 3 seconds
 
L

Lennon Day-Reynolds

I'm not sure how the performance would compare, but if you really want
to write this in the "Ruby style," you might consider using the
DateTime class from the standard 'date' modue.

Ex:

require 'date'
d = DateTime.parse('31-DEC-03 12:59:59')
d.year
=> 3
d.month
=> 12
d.hour
=> 12

(etc., etc.)

In general, though, I usually feel that any time my Ruby code is
running within a constant factor of the time equivalent Perl (or
Python) code takes, I'm probably not doing anything wrong. If I start
to see more troublesome scaling issues, though, (i.e., exponential
runtime increases relative to input size) then there's probably
something that needs to be done to the code.

Lennon
 
G

gabriele renzi

I realise I'm doing this a perlish way, but my question is, is it possible
to do this operation in Ruby in a time more comparable to what the Perl
version's getting? (That's about 4 seconds; my Ruby code runs in about 17
seconds over the same data set, which is far smaller than the production
data set.)

not sure, maybe you can make it 4 lines instead of 17 by using csv.rb
:)

Probably this can be speeded up, but perl is faster than ruby often.

PS
sorry, I'm going away in ten minutes few time to play :/
 
N

nobu.nokada

Hi,

At Thu, 15 Jul 2004 15:07:18 +0900,
Dave Burt wrote in [ruby-talk:106480]:
I realise I'm doing this a perlish way, but my question is, is it possible
to do this operation in Ruby in a time more comparable to what the Perl
version's getting? (That's about 4 seconds; my Ruby code runs in about 17
seconds over the same data set, which is far smaller than the production
data set.)

Which version of ruby do you use?
Dir.open(dir).each do |file|
next if file[0] == ?. # skip dotfiles
reading = false
File.open(dir + '/' + file).each_line do |line|
reading ||= (date_regex =~ line &&
(($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6 >= start_date))
a << line if reading
end
end

You open files but never close, this may cause too frequent GC.

IO.foreach(dir + '/' + file) do |line|
if reading
a << line
else
reading = (date_regex =~ line &&
(($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6 >= start_date))
end
end
 
A

Ara.T.Howard

On Thu, 15 Jul 2004, Dave Burt wrote:

can you send me some sample data?

-a
I realise I'm doing this a perlish way, but my question is, is it possible
to do this operation in Ruby in a time more comparable to what the Perl
version's getting? (That's about 4 seconds; my Ruby code runs in about 17
seconds over the same data set, which is far smaller than the production
data set.)

Basically, we have CSV files with a date like 31-DEC-03 23:59:59 as the
first field (always in order), and the task is to grab into an array (to
later process further) just the parts of each file that fall after a given
date.

The main slow bit seems to be the string concatenation and comparison
(...+$4+$5+$6 >= start_date).

################################################################

#!perl

$start_date = '20040000000000'; # "yyyymmddhhmmss"
$dir = "data";

@months = qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);
%mm = {};
for ($i = 0; $i < 12; $i++) {
$mm{$months[$i]} = sprintf('%.2d', $i)
}
undef @months;

@a = ();

opendir DIR, $dir;
while ($_ = readdir DIR) {
next if /^\./; # skip dotfiles
open IN, "$dir/$_";
while (<IN>) {
/^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/;
$cc = ($3 ge '87' ? '19' : '20');
if ("$cc$3$mm{$2}$1$4$5$6" ge $start_date) {
while (<IN>) {
push @a, $_;
}
}
}
close IN;
}
closedir DIR;

$t = time - $t;
print "Read " . scalar(@a) . " lines in $t seconds$/"; # 4 seconds

$t = time;
open OUT, ">perl.out";
print OUT @a;
$t = time - $t;
print "Wrote in $t seconds$/"; # 3 seconds

################################################################

#!ruby

mm = Hash.new
i = '00'
%w(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC).each do |mmm|
mm[mmm] = i = i.succ
end

start_date = '20040000000000' # "yyyymmddhhmmss"
dir = "data"

date_regex = /^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/
a = []
t = Time.new
reading = false

Dir.open(dir).each do |file|
next if file[0] == ?. # skip dotfiles
reading = false
File.open(dir + '/' + file).each_line do |line|
reading ||= (date_regex =~ line &&
(($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6 >= start_date))
a << line if reading
end
end

t = Time.new - t;
puts "Read #{a.size} lines in #{t} seconds"; # 17 seconds

t = Time.new
File.open('ruby.out', 'w') do |f|
f.print a.join
end
t = Time.new - t;
puts "Wrote #{a.size} lines in #{t} seconds"; # 3 seconds

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
D

denis

Dave Burt said:
The main slow bit seems to be the string concatenation and comparison
(...+$4+$5+$6 >= start_date).

The addition operator always creates a new String object. So when you
chain a lot of additions you have a lot of temporary String objects
that are created. You can avoid that by using the Append ( << ) method

(($3>='87'?'19':'20') << $3 << mm[$2] << $1 << $4 << $5 << $6 >=
start_date)

Alternatively you can use string interpolation

("#{($3>='87'?'19':'20')}#{$3}#{mm[$2]}#{$1}#{$4}#{$5}#{$6}" >=
start_date)

or Array.join, like that

([($3>='87'?'19':'20'), $3, mm[$2], $1, $4, $5, $6].join >=
start_date)

I hope it helps

Denis
 
D

denis

Dave Burt said:
The main slow bit seems to be the string concatenation and comparison
(...+$4+$5+$6 >= start_date).

You could try this one also (with String Arrays):

# [yy, yy, mm, dd, hh, mm, ss] as strings
start_date = %w{20 04 00 00 00 00 00}
 
A

Ara.T.Howard

minimize IO and use the fast stringscanner library:


~ > parse.rb csv/
Read 29696 lines in 11.699876 seconds
Wrote 29696 lines in 0.03556 seconds

~ > parse.pl csv/
Read 29696 lines in 7 seconds
Wrote in 0 seconds

~ > diff -u perl.out ruby.out


here's the code(s). note that your perl script had two bugs in it - times were
not reported correctly and the first line containing a valid starting date was
not written to file. the below assumes (like your code does) that the input is
sorted in ascending order (probably not a good assumption since it will fail
silently if not):


~ > cat parse.rb
#!/usr/bin/env ruby
require 'strscan'
dir = ARGV.shift

mm = Hash.new
i = '00'
%w(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC).each do |mmm|
mm[mmm] = i = i.succ
end

start_date = '20040101000000' # "yyyymmddhhmmss"
date_regex = /^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d).*$\n/o
anyline = %r/^.*$\n/o
a = []
t = Time.new
buf = nil

Dir.foreach(dir) do |path|
next if path[0] == ?.

buf = IO.read(File.join(dir, path))
s = StringScanner.new buf

while s.rest?
if s.scan date_regex
date = "#{ s[3] >= '87' ? '19' : '20' }#{ s[3] }#{ mm[s[2]] }#{ s[1] }#{ s[4] }#{ s[5] }#{ s[6] }"
if date >= start_date
a << s[0]
a << s.scan(anyline) while s.rest?
end
else
s.scan anyline
end
end
end

t = Time.now - t;
puts "Read #{ a.size } lines in #{ t } seconds"
t = Time.now
File.open('ruby.out', 'w'){|f| a.each{|e| f.print e}}
t = Time.new - t;
puts "Wrote #{ a.size } lines in #{ t } seconds"; # 3 seconds

~ > cat parse.pl
#!/usr/bin/env perl
$dir = shift;
$start_date = '20040000000000'; # "yyyymmddhhmmss"

@months = qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);
%mm = {};
for ($i = 0; $i < 12; $i++) {
$mm{$months[$i]} = sprintf('%.2d', $i)
}
undef @months;

@a = ();
$t = time;

opendir DIR, $dir;
while ($_ = readdir DIR) {
next if /^\./; # skip dotfiles
open IN, "$dir/$_";
while (<IN>) {
/^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/;
$cc = ($3 ge '87' ? '19' : '20');
if ("$cc$3$mm{$2}$1$4$5$6" ge $start_date) {
push @a, $_;
while (<IN>) {
push @a, $_;
}
}
}
close IN;
}
closedir DIR;

$t = time - $t;
print "Read " . scalar(@a) . " lines in $t seconds$/"; # 4 seconds
$t = time;
open OUT, ">perl.out";
print OUT @a;
$t = time - $t;
print "Wrote in $t seconds$/"; # 3 seconds


i generated the data sets with this:

~ > cat gendata.rb
require 'fileutils'
dir = ARGV.shift
FileUtils.mkdir_p dir

t_start = Time.mktime(1987)
t_end = Time.now
delta_t = t_end - t_start
t_fmt = '%d-%b-%y %H:%M:%S' # like 31-DEC-03 23:59:59

1024.times do |fn|
path = File.join dir, "#{ fn }.csv"
open(path, 'w') do |f|
time = t_start + rand(delta_t)
1024.times do |lineno|
row = time.strftime(t_fmt).upcase, rand(42), rand(42), rand(42), rand(42)
f.puts(row.join(','))
time += rand(42)
end
end
end

it makes 1024 files, each of 1024 lines containing ordered tuples of a format
like your input data.


-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
S

Sean Russell

#!ruby
require 'csv'
require 'parsedate'

start_date = Time.local(*ParseDate.parsedate('1-APR-03 00:00:00'))
dir = "data"
a = []

t = Time.new
Dir.entries( dir ).each do |f|
fpath = File.join( dir, f )
if FileTest.file?( fpath )
CSV.parse( fpath ) do |row|
a << row if Time.local(*ParseDate.parsedate(row[0].to_s)) >= start_date
end
end
end
t = Time.new - t;
puts "Read #{a.size} lines in #{t} seconds";

# It would be more efficient to output the data in the other loop, but I
# assume you're wanting to do some additional processing to it here.
t = Time.new
File.open( "ruby.out", "w" ) do |outfile|
CSV::Writer.generate(outfile) do |csv|
a.each {|r| csv.add_row r }
end
end
t = Time.new - t;
puts "Wrote #{a.size} lines in #{t} seconds"
 
E

Ernie

Dave Burt said:
I realise I'm doing this a perlish way, but my question is, is it possible
to do this operation in Ruby in a time more comparable to what the Perl
version's getting? (That's about 4 seconds; my Ruby code runs in about 17
seconds over the same data set, which is far smaller than the production
data set.)

Basically, we have CSV files with a date like 31-DEC-03 23:59:59 as the
first field (always in order), and the task is to grab into an array (to
later process further) just the parts of each file that fall after a given
date.

The main slow bit seems to be the string concatenation and comparison
(...+$4+$5+$6 >= start_date).

################################################################
I built 3 files all with only 9 lines of data, one of the files had a line
that fails the regex text.
I read the files and find the appropriate lines 1000 times in just under 4
seconds on a Pentium 850 Windows XP machine running Ruby 1.81-12.

In the code below I use interpolation, rather than concatenation which
speeds things up a little about 1.5 seconds in the trial.
I also take advantage of a couple of Ruby features. Array#delete_if to
eliminate the lines that fail the regex.
I add a function to class Array that does a binary search for the right
place in the array. This eliminates some searching through the file.
This will speed up your search. Of course the increase in speed will depend
on how many lines exist in each file and how many lines precede the one
where you want to start.

You could also write a function in Perl that would do a binary search as
well.

Since you have sorted dates to begin with, there is no reason not to do a
binary search.
Please reply to the group with your results if you try this on your data .

Ernie


class Array
def findGE(start_date, date_regex, mm)
starter=0
ender=self.length
while true do
pt=(ender-starter)/2 + starter
date_regex =~ self[pt]
#if ($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6 >= start_date
if "#{$3>='87'?'19':'20'}#{$3}#{mm[$2]}#{$4}#{$5}#{$6}" >=
start_date
ender=pt
else
starter=pt
end
if (ender-starter) <= 1
date_regex =~ self[starter]
#return starter if ($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6
= start_date
return starter if
"#{$3>='87'?'19':'20'}#{$3}#{mm[$2]}#{$4}#{$5}#{$6}" >= start_date
date_regex =~ self[ender]
#return ender if ($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6
= start_date
return ender if
"#{$3>='87'?'19':'20'}#{$3}#{mm[$2]}#{$4}#{$5}#{$6}" >= start_date
return false
end
end
end
end
mm = Hash.new
i = '00'
%w(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC).each do |mmm|
mm[mmm] = i = i.succ
end
start_date = '20040000000000'
date_regex = /^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/
dir="C:/dataTest"
t=Time.now.to_f
1..1000.times do
a=[]
aFinal=[]
Dir.open(dir).each do |file|
next if file[0] == ?.
File.open(dir + '/' + file){|f|
a=f.readlines}
a.delete_if{|line| not date_regex =~ line}
z=a.findGE(start_date, date_regex, mm)
aFinal = aFinal + a[z...a.length] if z
end
end
tend=Time.now.to_f
puts "#{tend-t}"

Here is one file, (the one with the bad line)

12-APR-98 21:59:59, aaaa,bbbb,cccc,dddd
30-JUL-99 20:05:35, cccc,ffff,gggg,hhhh
27-JAN-00 15:15:45, xxxx,ffff,cccc,dddd
28-FEB-01 12:30:20, zzzz,bbbb,dddd,gggg
31-DEC-03 23:59:59, xxxx,xxxxx,yyyyy,zzzzz
01-JAN-04 00:01:00, aaaa,bbbb,cccc,dddd
01-FEB-04 00:01:05, bbbb,cccc,dddd,xxxx
05-MAR-04 05:01:59, aaaa,bbbb,cccc,dddd
08-APR-04 05:15:35, aaaa,bbbb,cccc,xxxx
nnyy,aaa,bbb,ccc,xxx
 
E

Ernie

Dave Burt said:
I realise I'm doing this a perlish way, but my question is, is it possible
to do this operation in Ruby in a time more comparable to what the Perl
version's getting? (That's about 4 seconds; my Ruby code runs in about 17
seconds over the same data set, which is far smaller than the production
data set.)

Basically, we have CSV files with a date like 31-DEC-03 23:59:59 as the
first field (always in order), and the task is to grab into an array (to
later process further) just the parts of each file that fall after a given
date.

The main slow bit seems to be the string concatenation and comparison
(...+$4+$5+$6 >= start_date).

################################################################
Dave Burt said:
I realise I'm doing this a perlish way, but my question is, is it possible
to do this operation in Ruby in a time more comparable to what the Perl
version's getting? (That's about 4 seconds; my Ruby code runs in about 17
seconds over the same data set, which is far smaller than the production
data set.)

Basically, we have CSV files with a date like 31-DEC-03 23:59:59 as the
first field (always in order), and the task is to grab into an array (to
later process further) just the parts of each file that fall after a given
date.

The main slow bit seems to be the string concatenation and comparison
(...+$4+$5+$6 >= start_date).

################################################################


I built 3 files all with only 9 lines of data, one of the files had a line
that fails the regex text.
I read the files and find the appropriate lines 1000 times in just under 4
seconds on a Pentium 850 Windows XP machine running Ruby 1.81-12.

In the code below I use interpolation, rather than concatenation which
speeds things up a little about 1.5 seconds in the trial.
I also take advantage of a couple of Ruby features. Array#delete_if to
eliminate the lines that fail the regex.
I add a function to class Array that does a binary search for the right
place in the array. This eliminates some searching through the file.
This will speed up your search. Of course the increase in speed will depend
on how many lines exist in each file and how many lines precede the one
where you want to start.

You could also write a function in Perl that would do a binary search as
well.

Since you have sorted dates to begin with, there is no reason not to do a
binary search.
Please reply to the group with your results if you try this on your data .

Ernie


class Array
def findGE(start_date, date_regex, mm)
starter=0
ender=self.length
while true do
pt=(ender-starter)/2 + starter
date_regex =~ self[pt]
#if ($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6 >= start_date
if "#{$3>='87'?'19':'20'}#{$3}#{mm[$2]}#{$4}#{$5}#{$6}" >=
start_date
ender=pt
else
starter=pt
end
if (ender-starter) <= 1
date_regex =~ self[starter]
#return starter if ($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6
= start_date
return starter if
"#{$3>='87'?'19':'20'}#{$3}#{mm[$2]}#{$4}#{$5}#{$6}" >= start_date
date_regex =~ self[ender]
#return ender if ($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6
= start_date
return ender if
"#{$3>='87'?'19':'20'}#{$3}#{mm[$2]}#{$4}#{$5}#{$6}" >= start_date
return false
end
end
end
end
mm = Hash.new
i = '00'
%w(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC).each do |mmm|
mm[mmm] = i = i.succ
end
start_date = '20040000000000'
date_regex = /^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/
dir="C:/dataTest"
t=Time.now.to_f
1..1000.times do
a=[]
aFinal=[]
Dir.open(dir).each do |file|
next if file[0] == ?.
File.open(dir + '/' + file){|f|
a=f.readlines}
a.delete_if{|line| not date_regex =~ line}
z=a.findGE(start_date, date_regex, mm)
aFinal = aFinal + a[z...a.length] if z
end
end
tend=Time.now.to_f
puts "#{tend-t}"

Here is one file, (the one with the bad line)

12-APR-98 21:59:59, aaaa,bbbb,cccc,dddd
30-JUL-99 20:05:35, cccc,ffff,gggg,hhhh
27-JAN-00 15:15:45, xxxx,ffff,cccc,dddd
28-FEB-01 12:30:20, zzzz,bbbb,dddd,gggg
31-DEC-03 23:59:59, xxxx,xxxxx,yyyyy,zzzzz
01-JAN-04 00:01:00, aaaa,bbbb,cccc,dddd
01-FEB-04 00:01:05, bbbb,cccc,dddd,xxxx
05-MAR-04 05:01:59, aaaa,bbbb,cccc,dddd
08-APR-04 05:15:35, aaaa,bbbb,cccc,xxxx
nnyy,aaa,bbb,ccc,xxx
 
D

Dave Burt

Thanks everyone for your input.

I'll post back in about a week when I've had have a chance to try some of
this:
* csv.rb (may well make the program more legible - thanks gabriele renzi)
* making sure files aren't all opened and not closed (oops! thanks Nobu
Nakada)
* binary search (thanks Ernie)
* String#<< or interpolation to gather the match-bits (or maybe even string
arrays... thanks Denis et. al.)
* strscan.rb (thanks Ara T. Howard)

I'm using the windows package 1.81 (13), on a P4 running Win XP. The target
system, though, is an old crusty box, maybe P1, running Windows NT.

Ara, for sample data, your generator does a pretty good job. Here are some
stats, in case you're interested:
* about 200 files (increasing very slowly; maybe 1 per year)
* roughly 1 record per file per hour
* oldest files are up to around 4 years old, and around 10MB
* that makes records about 300 bytes on average
* that makes about 35k records in those oldest files
* the records (comma-separated) consist of a date field (DD-MMM-YY HH:MM:SS)
and about 20 decimal fields
* my test runs are on about 10% of these files.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top