brute force string search

M

Martin Pirker

Hi...

given: String of several Mb
problem: find the lines in String containing "xyz"


Idea 1:
String.scan(/.*xyz.*/) -> ~10s runtime

Idea 2:
String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)

Idea 3:
loop String.index("xyz",lastmatch+3)
loop results array and grep in match area for line
-> 0,5s


C extension the only faster option left? :)

Martin
 
B

Bertram Scharpf

Hi,

Am Montag, 10. Jan 2005, 09:51:22 +0900 schrieb Martin Pirker:
given: String of several Mb
problem: find the lines in String containing "xyz"


Idea 1:
String.scan(/.*xyz.*/) -> ~10s runtime

Idea 2:
String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)

Idea 3:
loop String.index("xyz",lastmatch+3)
loop results array and grep in match area for line
-> 0,5s

C extension the only faster option left? :)

What about this:

str = File.new( 'longfile').read
str.scan /.*/ do |x| puts x if x =~ /xyz/ end


Bertram
 
B

Bill Atkins

What about this:

str = File.new( 'longfile').read
str.scan /.*/ do |x| puts x if x =~ /xyz/ end

How is that any different from Idea 1?

To the OP, 0.5 seconds seems pretty fast to me. I doubt a C extension
would make much difference since index is already written in C and
probably already highly optimized.

Bill
 
C

Charles Mills

Martin said:
Hi...

given: String of several Mb
problem: find the lines in String containing "xyz"


Idea 1:
String.scan(/.*xyz.*/) -> ~10s runtime

Idea 2:
String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)

Idea 3:
loop String.index("xyz",lastmatch+3)
loop results array and grep in match area for line
-> 0,5s
Try the StringScanner library. It is very fast.
###
require 'strscan'

scanner = StringScanner.new(str, false)
match = scanner.scan(/.*xyz.*/)
###
you will have to change your regexp in order to skip lines not
containing 'xyz' since it always matches from the current position in
the string.
Docs: http://i.loveruby.net/en/prog/strscan.html
If you just want to count the number of lines you could use the regexp
look ahead and look behind instead of '.*'.
C extension the only faster option left? :)

It is an option, but it might not be much faster.

-Charlie
 
R

Robert Klemme

Martin Pirker said:
Hi...

given: String of several Mb
problem: find the lines in String containing "xyz"


Idea 1:
String.scan(/.*xyz.*/) -> ~10s runtime

Did you try the anchored version? Did it make a difference?
=> ["bxyz", "cxyz"]

Did you try the block form, i.e.,

s.scan(/^.*xyz.*$/) {|m| # work with m}
Idea 2:
String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)

Idea 3:
loop String.index("xyz",lastmatch+3)
loop results array and grep in match area for line
-> 0,5s


C extension the only faster option left? :)

Im not sure whether you will gain much, because certain things will have
to be done anyway: allocation of the result array, of the match strings
and expansion of the result array.

Kind regards

robert
 
R

Robert Klemme

Robert Klemme said:
Martin Pirker said:
Hi...

given: String of several Mb
problem: find the lines in String containing "xyz"


Idea 1:
String.scan(/.*xyz.*/) -> ~10s runtime

Did you try the anchored version? Did it make a difference?
=> ["bxyz", "cxyz"]

Did you try the block form, i.e.,

s.scan(/^.*xyz.*$/) {|m| # work with m}
Idea 2:
String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)

Idea 3:
loop String.index("xyz",lastmatch+3)
loop results array and grep in match area for line
-> 0,5s


C extension the only faster option left? :)

Im not sure whether you will gain much, because certain things will have
to be done anyway: allocation of the result array, of the match strings
and expansion of the result array.

Kind regards

robert

Here are other options:
s.inject([]){|ar,x| (x.chomp! ; ar << x) if /xyz/ =~ x; ar} => ["bxyz", "cxyz"]
s.inject([]){|ar,x| (x.chomp! ; ar << x) if x.include? "xyz"; ar}
=> ["bxyz", "cxyz"]

Cheers

robert
 
M

Michael C. Libby

Hi...

given: String of several Mb
problem: find the lines in String containing "xyz"


Idea 1:
String.scan(/.*xyz.*/) -> ~10s runtime

Idea 2:
String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)

Idea 3:
loop String.index("xyz",lastmatch+3)
loop results array and grep in match area for line
-> 0,5s

Did you try something like this?

irb(main):003:0> big_string.each do |x| puts x if x=~/xyz/ end
eeexyzfff
qqqqxyzqqq
=> "abc\neeexyzfff\naadfwe\nqqqqxyzqqq\n"
irb(main):004:0> big_string.split.grep(/xyz/)
=> ["eeexyzfff", "qqqqxyzqqq"]
irb(main):005:0>

Intuitively it seems to me that wildcards (especially unanchored) in REs
will slow them down.

-Michael
 
S

Stefan Lang

Michael said:
Did you try something like this?

irb(main):003:0> big_string.each do |x| puts x if x=~/xyz/ end

Perhaps faster with: ... if x.include? "xyz"
eeexyzfff
qqqqxyzqqq
=> "abc\neeexyzfff\naadfwe\nqqqqxyzqqq\n"
irb(main):004:0> big_string.split.grep(/xyz/)
=> ["eeexyzfff", "qqqqxyzqqq"]
irb(main):005:0>

Intuitively it seems to me that wildcards (especially unanchored) in REs
will slow them down.

-Michael
 
M

Martin Pirker

georgesawyer said:
"xyz".

This seems to be a real-world problem, and it seems the 'String of several
Mb' is relatively fixed. If so, here are some ideas.
[fascinating read]

This is why even when one finds a "good enough" solution it's still a good
idea to bounce the solution to other people - one never really knows
what useful comments come back :)


Slicing the data up gives an "overlap" problem.

After some thinking however it's possible to do some presorting of the
data, so a full search must be done at the beginning, but next findings
should be located in area around of last search hit.
e.g. Cutting the search from 100% of a String to specific 1% gives 100
fold improvement -> good :)

How to do that?
String.index accepts a offset to start from
String.rindex accepts a offset to end with.

So if e.g. I want to search 1Mb in the middle of a 100Mb String i need
a String.yetanotherindex("xyz",startoffset,endoffset), otherwise
performance sucks again.

Looking at string.c of Ruby, .index and .rindex appear to be
quite simple brute-force (that's why they are so fast)

I RFC adding an .index function to String which accepts 3 parameters:
(searchitem, startoffset, endoffset)
IMHO this would make an useful addition to stdlib shipped with Ruby.


Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top