handling of regexp objects that aren't referenced by variables,arrays, tables or objects

T

ThomasW

Hi,

first of all I have to say I'm relatively unexperienced with Ruby and
also new to regular expressions. This causes me some problems:

I'm parsing text files and am using a lot of regexps for this.
Initially I was doing something like this:

file.each_line { |line|
if line =~ /^pattern[a]*/
process_pattern_a(line)
elsif line =~ /pat+e(rn)? b\s*$/
process_pattern_b(line)
# some more elsifs
end
}

But this was really, really slow. My suspicion is that the regexp
objects are recreated and thrown away for every iteration. Storing
all patterns in a table and referencing them like

file.each_line { |line|
if line =~ $line_patterns["pattern a"]
process_pattern_a(line)
elsif line =~ $line_patterns["pattern b"]
process_pattern_b(line)
# some more elsifs
end
}

made things tremendously faster, but I'm not really keen on storing
every regular expression that occurs somewhere in my program in this
table or as a variable. This splits up code that I would like to have
at one place and can create variable clutter.[*]

Is it the case that such "anonymous" objects like regexps (maybe also
strings?) are re-created whenever the code snippet they are defined in
is executed? If so, is there a convenient way of preventing this? Is
this only the case for regexps or also for strings and other objects?
(Why is it the case at all - I can't make any sense of it?) I would
like to learn how I can write Ruby code that is reasonably efficient
in this regard because the impact on execution time in the described
situation was so immense. (I'm currently using Ruby 1.9.1.)

Thanks!
Thomas W.


[*] I maybe could also store the regexps and the to be executed
functions in a table with the regexps as keys and the functions as
values, iterating through them until a matching regexp key was found
so that the function that is stored as a value can be executed. But
this is only possible in situations similar to the described one.
 
E

Ehsanul Hoque

Is it the case that such "anonymous" objects like regexps (maybe also
strings?) are re-created whenever the code snippet they are defined in
is executed? If so=2C is there a convenient way of preventing this? Is
this only the case for regexps or also for strings and other objects?
(Why is it the case at all - I can't make any sense of it?) I would
like to learn how I can write Ruby code that is reasonably efficient
in this regard because the impact on execution time in the described
situation was so immense. (I'm currently using Ruby 1.9.1.)

Yes=2C indeed a new object is indeed created every time an anonymous object=
is created. The only core object I know of for which this is not true is t=
he symbol=2C which is basically an immutable string. There may be others I'=
m not aware of though. I suppose your code shows that there just might be a=
need for the symbol equivalent of a regexp.
=0A=
_________________________________________________________________=0A=
Hotmail=AE has ever-growing storage! Don=92t worry about storage limits.=0A=
http://windowslive.com/Tutorial/Hotmail/Storage?ocid=3DTXT_TAGLM_WL_HM_Tuto=
rial_Storage_062009=
 
T

Thairuby ->a, b {a + b}

Is this ok? But it still use variable :(

file.each_line { |line|
if line =~ (a ||= $line_patterns["pattern a"])
process_pattern_a(line)
elsif line =~ (b ||= $line_patterns["pattern b"])
process_pattern_b(line)
# some more elsifs
end
}
 
C

Caleb Clausen

Yes, indeed a new object is indeed created every time an anonymous object is
created. The only core object I know of for which this is not true is the
symbol, which is basically an immutable string. There may be others I'm not
aware of though. I suppose your code shows that there just might be a need
for the symbol equivalent of a regexp.

Actually, I believe that regexp literals are created only once even if
they're executed multiple times. The exception to this would be when
you use #{} within a regexp... that forces ruby to not only create a
new object each time the regexp literal is executed, it has to
recompile the regexp each time.... and that is really slow. You can
bypass this behavior by using the o regexp option, but that only works
right if the value of the inclusion (what's inside #{}) is guaranteed
to be the same on each execution.

Thomas, are you using #{} within your regexps? If so, you should try
sticking an o on the end of each one; that will probably solve your
performance problem. for instance
x =~ /foo#{bar}/o
instead of
x =~ /foo#{bar}/
 
T

ThomasW

I think that's not quite what I meant. Of course, if I define the
same regular expression twice at different places, there would be two
regexp objects.

Actually, I believe that regexp literals are created only once even if
they're executed multiple times. The exception to this would be when
you use #{} within a regexp... that forces ruby to not only create a
new object each time the regexp literal is executed, it has to
recompile the regexp each time.... and that is really slow. You can
bypass this behavior by using the o regexp option, but that only works
right if the value of the inclusion (what's inside #{}) is guaranteed
to be the same on each execution.


Thanks so much! Your suspicion was right, I am indeed using #{} in
some of the regular expressions, and the o option does fix the issue.
And your explanation why the expressions would otherwise be recompiled
in every iteration is now very obvious to me.

Now my code is already a bit shorter :)!

Thomas W.
 
G

Gary Wright

I'm parsing text files and am using a lot of regexps for this.
Initially I was doing something like this:

file.each_line { |line|
if line =~ /^pattern[a]*/
process_pattern_a(line)
elsif line =~ /pat+e(rn)? b\s*$/
process_pattern_b(line)
# some more elsifs
end
}

This example is perfect for Ruby's case statement:

file.each_line { |line|
case line
when /^pattern[a]*/o
process_pattern_a(line)
when /pat+e(rn)? b\s*$/o
process_pattern_b(line)
# more when clauses
else
# handle no match
end
}



Gary Wright
 
T

Thairuby ->a, b {a + b}

Thairuby said:
Is this ok? But it still use variable :(

file.each_line { |line|
if line =~ (a ||= $line_patterns["pattern a"])
process_pattern_a(line)
elsif line =~ (b ||= $line_patterns["pattern b"])
process_pattern_b(line)
# some more elsifs
end
}

I'm wrong typing. It would be

file.each_line { |line|
if line =~ (a ||= /^pattern[a]*/)
process_pattern_a(line)
elsif line =~ (b ||= /pat+e(rn)? b\s*$/)
process_pattern_b(line)
# some more elsifs
end
}

Does it have o option for string? :)
 
T

ThomasW

I'm parsing text files and am using a lot of regexps for this.
Initially I was doing something like this:
file.each_line { |line|
 if line =~ /^pattern[a]*/
   process_pattern_a(line)
 elsif line =~ /pat+e(rn)? b\s*$/
   process_pattern_b(line)
 # some more elsifs
 end
}

This example is perfect for Ruby's case statement:

file.each_line { |line|
   case line
   when /^pattern[a]*/o
    process_pattern_a(line)
   when /pat+e(rn)? b\s*$/o
    process_pattern_b(line)
   # more when clauses
   else
     # handle no match
   end

}

Gary Wright

Thanks for that tip. I wasn't aware that this also works with regexp
matches. It's great that it does! By the way, is there anything
substantially different from an elsif chain, except for being slightly
less typing?

Thomas W.
 
G

Gary Wright

Thanks for that tip. I wasn't aware that this also works with regexp
matches. It's great that it does! By the way, is there anything
substantially different from an elsif chain, except for being slightly
less typing?

The semantics are the same in this case but I think the
case statement highlights the fact that you are doing a
sequence of matches against a single object, whereas the
standard if/then/else is a more general construct.

Gary Wright
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

Is this ok? But it still use variable :(

file.each_line { |line|
if line =~ (a ||= $line_patterns["pattern a"])
process_pattern_a(line)
elsif line =~ (b ||= $line_patterns["pattern b"])
process_pattern_b(line)
# some more elsifs
end
}

I'm wrong typing. It would be

file.each_line { |line|
if line =~ (a ||= /^pattern[a]*/)
process_pattern_a(line)
elsif line =~ (b ||= /pat+e(rn)? b\s*$/)
process_pattern_b(line)
# some more elsifs
end
}

Does it have o option for string? :)[/QUOTE]
Unfortunately, I don't think this does anything, because a and b are
declared within the block, so while the scope is the same, the extent is
not. Essentially, a and b are no longer bound, after each iteration of the
loop. So upon entering each iteration, they do not retain their previously
assigned values.

This can be illustrated:

"patterna\npatte b".each_line do |line|

p line

puts "defined?(a) => #{defined?(a).inspect}"
puts "defined?(b) => #{defined?(b).inspect}"

if line =~ (a ||= /^pattern[a]*/)
elsif line =~ (b ||= /pat+e(rn)? b\s*$/)
else
end

puts "defined?(a) => #{defined?(a).inspect}"
puts "defined?(b) => #{defined?(b).inspect}" , ''

end
__END__

Which has the following output:
"patterna\n"
defined?(a) => nil
defined?(b) => nil
defined?(a) => "local-variable(in-block)"
defined?(b) => "local-variable(in-block)"

"patte b"
defined?(a) => nil
defined?(b) => nil
defined?(a) => "local-variable(in-block)"
defined?(b) => "local-variable(in-block)"

You can see, that a and b were defined after the if statement in "patterna",
but were no longer defined before the if statement for "patte b"
 
T

Thairuby ->a, b {a + b}

Oh, I forgot the scope of local variable :(
Thank you very much for your explanation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top