Curious regexp behavior

D

Derek Lewis

On a whim, I just decided to try an experiment with regexps, to see how
they perform in two slightly different cases. I wanted to see how using
a single regexp object for many many evaluations performed compared to
using the regexp within the loop.

The scripts I wrote searched through a words file that is 234937 lines
long.

Here's the scripts I wrote, to clarify:
First one:

total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ /[a-df-h][aeiou]{2}/
}
}
puts total

Second one:

rexp = /[a-df-h][aeiou]{2}/
total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ rexp
}
}
puts total


I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that? :)

--
Derek Lewis

===================================================================
Java Web-Application Developer

Email : (e-mail address removed)
Cellular : 778.898.5825
Website : http://www.lewisd.com

"If you've got a 5000-line JSP page that has "all in one" support
for three input forms and four follow-up screens, all controlled
by "if" statements in scriptlets, well ... please don't show it
to me :). Its almost dinner time, and I don't want to lose my
appetite :)."
- Craig R. McClanahan
 
C

Charles Mills

Derek said:
On a whim, I just decided to try an experiment with regexps, to see how
they perform in two slightly different cases. I wanted to see how using
a single regexp object for many many evaluations performed compared to
using the regexp within the loop.

The scripts I wrote searched through a words file that is 234937 lines
long.

Here's the scripts I wrote, to clarify:
First one:

total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ /[a-df-h][aeiou]{2}/
}
}
puts total

Second one:

rexp = /[a-df-h][aeiou]{2}/
total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ rexp
}
}
puts total


I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that? :)
I'll wager a guess. In the first version Ruby knows that
'/[a-df-h][aeiou]{2}/' is a regexp. In the second one Ruby doesn't
know if 'rexp' is a variable or method, so it has to do 1 maybe 2 look
ups on every interation before it dispatches String#=~.
Also regexp's are immutable so Ruby doesn't allocate a new regexp on
every interation and storing the regexp has no effect in that regard.

-Charlie
 
E

Eric Hodel

--Apple-Mail-15--771973824
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII; format=flowed

First one:

total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ /[a-df-h][aeiou]{2}/
^^^^ inline regexp (part of the AST)
}
}
puts total

Second one:

rexp = /[a-df-h][aeiou]{2}/
total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ rexp ^^^^ variable lookup
}
}
puts total


I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that? :)

Inline regexps are much faster than a variable lookup then using the
methods on the Regexp object.

--
Eric Hodel - (e-mail address removed) - http://segment7.net
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

--Apple-Mail-15--771973824
content-type: application/pgp-signature; x-mac-type=70674453;
name=PGP.sig
content-description: This is a digitally signed message part
content-disposition: inline; filename=PGP.sig
content-transfer-encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)

iD8DBQFCEtoXMypVHHlsnwQRAh7aAJ46tcIOb0m2MDduBhOXkGpMgX5jTgCfYeqS
Go0t7KlyH3HliLg7xAtWbQE=
=aKt5
-----END PGP SIGNATURE-----

--Apple-Mail-15--771973824--
 
R

Ryan Davis

I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that? :)

Use ParseTree and you can see why!!!

<576> echo "a=/blah/; 's' =~ a" | parse_tree_show -f
(cut for readability)
[:lasgn, :a, [:lit, /blah/]],
[:call, [:str, "s"], :=~, [:array, [:lvar, :a]]]]]]]]
<577> echo "'s' =~ /blah/" | parse_tree_show -f
(cut for readability)
[:match3, [:lit, /blah/], [:str, "s"]]]]]]]

Basically, the inline regex avoids the lvar lookup and the call and
shoots straight into a match3 node. The lvar is probably not _that_
expensive, but method dispatch is not terribly cheap.
 
R

Robert Klemme

Derek Lewis said:
On a whim, I just decided to try an experiment with regexps, to see how
they perform in two slightly different cases. I wanted to see how using
a single regexp object for many many evaluations performed compared to
using the regexp within the loop.

The scripts I wrote searched through a words file that is 234937 lines
long.

Here's the scripts I wrote, to clarify:
First one:

total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ /[a-df-h][aeiou]{2}/
}
}
puts total

Second one:

rexp = /[a-df-h][aeiou]{2}/
total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ rexp
}
}
puts total


I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that? :)

Did you try the same with the matching reversed, i.e., "rexp =~ word"
instead of "word =~ rexp"? Did it make a difference?

Kind regards

robert
 
W

William Morgan

Excerpts from Ryan Davis's mail of 16 Feb 2005 (EST):
Use ParseTree and you can see why!!!

<576> echo "a=/blah/; 's' =~ a" | parse_tree_show -f
(cut for readability)
[:lasgn, :a, [:lit, /blah/]],
[:call, [:str, "s"], :=~, [:array, [:lvar, :a]]]]]]]]
<577> echo "'s' =~ /blah/" | parse_tree_show -f
(cut for readability)
[:match3, [:lit, /blah/], [:str, "s"]]]]]]]

Very nice answer.

Like the original poster, I found the behavior counterintuitive. Perhaps
this is because our assumptions come from the C model of the universe,
where more local variables is typically faster, and method dispatch is
not a problem.

I wonder what the merits of collecting equivalences like these to form
some kind of post-hoc parse-tree optimization would be. Probably not
great, but it might be fun.
 
D

Derek Lewis

Did you try the same with the matching reversed, i.e., "rexp =~ word"
instead of "word =~ rexp"? Did it make a difference?

Kind regards

robert

I did, actually, and it was very slightly faster. Still slower than an
inline regexp, however.

Thanks for the insightful answers, everyone. It quite interesting to
find out how your favorite programming language works inside.

--
Derek Lewis

===================================================================
Java Web-Application Developer

Email : (e-mail address removed)
Cellular : 778.898.5825
Website : http://www.lewisd.com

"If you've got a 5000-line JSP page that has "all in one" support
for three input forms and four follow-up screens, all controlled
by "if" statements in scriptlets, well ... please don't show it
to me :). Its almost dinner time, and I don't want to lose my
appetite :)."
- Craig R. McClanahan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

unsub 0
TF-IDF 1
A regexp? 2
Bug with fifos in Linux? 1
Ruby Weekly News 14th - 20th February 2005 4
My Boggle Solver: I welcome any tips and tricks! 3
Combining IOs and Strings? 1
regexp+hash problem 5

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top