short regexp question

F

Fritzek

Hi folks

short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

tried to split using (/(\[|\])/) to get ["a", "[", "b", "]", "c"]

b=a.strip.split(/(\[|\])/)

and then joined the bits together. this just works in a simple case
like "ac", but "" could occur multiple times.

I need something like search for any "" and substitute with a
blank.

Thanks in advance

Fritzek
 
D

David A. Black

Hi --

Hi folks

short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

tried to split using (/(\[|\])/) to get ["a", "[", "b", "]", "c"]

b=a.strip.split(/(\[|\])/)

and then joined the bits together. this just works in a simple case
like "ac", but "" could occur multiple times.

I need something like search for any "" and substitute with a
blank.


b = a.delete("")


David

--
Rails training from David A. Black and Ruby Power and Light:
Intro to Ruby on Rails January 12-15 Fort Lauderdale, FL
Advancing with Rails January 19-22 Fort Lauderdale, FL *
* Co-taught with Patrick Ewing!
See http://www.rubypal.com for details and updates!
 
F

Fritzek

Hi David

thanks for quick answer. your code just works, if you know "b". I only
know the surrounding brackets "[" and "]" The bit in between could be
everything. sorry, forgot to mention.

Fritzek

Hi --



short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

tried to split using (/(\[|\])/) to get  ["a", "[", "b", "]", "c"]
b=a.strip.split(/(\[|\])/)

and then joined the bits together. this just works in a simple case
like "ac", but "" could occur multiple times.

I need something like search for any "" and substitute with a
blank.


b = a.delete("")

David

--
Rails training from David A. Black and Ruby Power and Light:
   Intro to Ruby on Rails  January 12-15   Fort Lauderdale, FL
   Advancing with Rails    January 19-22   Fort Lauderdale, FL *
   * Co-taught with Patrick Ewing!
Seehttp://www.rubypal.comfor details and updates!
 
B

Brian Candler

b = a.gsub(/\[b\]/,' ')

Also possibly useful is for you:

a = "aaabbbccc"
bits = a.split(/\[b\]/)
 
F

Fritzek

Hi Brian

thanks for your answer. as I stated to David, I just know about the
surrounding brackets not the bits between them.

Fritzek

b = a.gsub(/\[b\]/,' ')

Also possibly useful is for you:

a = "aaabbbccc"
bits = a.split(/\[b\]/)
 
R

Robert Klemme

Hi Sebastian

thanks for the solution. works perfect.

Fritzek

Fritzek said:
short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

"ac".gsub(/\[.*?\]/, " ")


Not sure whether it makes a difference performance wise but I am always
reluctant to use reluctant quantifiers. I'd rather do

irb(main):003:0> "ac".gsub /\[[^\]]*\]/, ' '
=> "a c"

Kind regards

robert
 
F

Fritzek

Hi Robert

thanks for your objection, but could you shortly explain the
difference (for regexp dummies like me)?

Fritzek

Hi Sebastian
thanks for the solution. works perfect.

Fritzek wrote:
short question how to use regexp the right way
given is a="ac"
I nedd b="a c"
"ac".gsub(/\[.*?\]/, " ")


Not sure whether it makes a difference performance wise but I am always
reluctant to use reluctant quantifiers.  I'd rather do

irb(main):003:0> "ac".gsub /\[[^\]]*\]/, ' '
=> "a c"

Kind regards

        robert
 
R

Robert Klemme

2008/9/19 Fritzek said:
thanks for your objection, but could you shortly explain the
difference (for regexp dummies like me)?

Ideally you read "Mastering Regular Expressions" which explains such
topics very nicely.

I believe it is generally better to be more specific about what is to
match (mainly for robustness reasons). Also, with the reluctant
quantifier for every character in the input a match against the next
sub pattern needs to be tested OR there needs to be backtracking to
find out whether there is a shorter match afterwards. Both seem not
very efficient. Granted, this is no hard evidence, but if you are
curious I suggest you do some benchmarks and read the book; it's
really good!

Kind regards

robert
 
R

Robert Klemme

2008/9/19 Tod Beardsley said:
I wrote a quickie benchmark. CPU speed and compile options will
certainly influence your results.

http://snippets.dzone.com/posts/show/6098

Hm, it seems line 13 and 18 are identical. Where's the lazy quantifier?

Here's what I'd consider a better benchmark, as it covers the
scenarios I was talking about, especially with situations where there
is a second potential end point ("b" in this case):

robert@fussel /cygdrive/c/Temp
$ cat l.rb
#!/bin/env ruby

require 'benchmark'

REP = 1_000
LONG = 1_000

STRINGS = [
["short match", "ab"],
["short mismatch", "a"],
["long match", "a" * LONG + "b"],
["long mismatch", "a" * LONG],
["short match double", "abab"],
["long match double", "a" * LONG + "bb"],
["long match double long", "a" * LONG + "b" + "a" * LONG + "b"],
]

Benchmark.bmbm(6 + STRINGS.inject(0) {|m,(a,b)| a.length > m ?
a.length : m }) do |b|
STRINGS.each do |label, str|
rep = /long mis/ =~ label ? 100 : 100_000

b.report "neg " + label do
rep.times { /a[^b]*b/ =~ str }
end

b.report "lazy " + label do
rep.times { /a.*?b/ =~ str }
end
end
end

robert@fussel /cygdrive/c/Temp
$ ./l.rb
Rehearsal ---------------------------------------------------------------
neg short match 0.282000 0.000000 0.282000 ( 0.288000)
lazy short match 0.297000 0.000000 0.297000 ( 0.284000)
neg short mismatch 0.328000 0.000000 0.328000 ( 0.341000)
lazy short mismatch 0.375000 0.000000 0.375000 ( 0.366000)
neg long match 9.531000 0.000000 9.531000 ( 9.982000)
lazy long match 12.625000 0.000000 12.625000 ( 12.764000)
neg long mismatch 4.672000 0.000000 4.672000 ( 4.742000)
lazy long mismatch 6.297000 0.000000 6.297000 ( 6.422000)
neg short match double 0.297000 0.000000 0.297000 ( 0.291000)
lazy short match double 0.281000 0.000000 0.281000 ( 0.287000)
neg long match double 9.406000 0.000000 9.406000 ( 9.443000)
lazy long match double 12.500000 0.000000 12.500000 ( 12.592000)
neg long match double long 9.516000 0.000000 9.516000 ( 9.642000)
lazy long match double long 12.547000 0.000000 12.547000 ( 12.745000)
----------------------------------------------------- total: 78.954000sec

user system total real
neg short match 0.312000 0.000000 0.312000 ( 0.305000)
lazy short match 0.297000 0.000000 0.297000 ( 0.301000)
neg short mismatch 0.375000 0.000000 0.375000 ( 0.388000)
lazy short mismatch 0.359000 0.000000 0.359000 ( 0.356000)
neg long match 9.344000 0.000000 9.344000 ( 9.637000)
lazy long match 12.547000 0.000000 12.547000 ( 12.777000)
neg long mismatch 4.703000 0.000000 4.703000 ( 4.783000)
lazy long mismatch 6.219000 0.000000 6.219000 ( 6.242000)
neg short match double 0.297000 0.000000 0.297000 ( 0.301000)
lazy short match double 0.297000 0.000000 0.297000 ( 0.297000)
neg long match double 9.453000 0.000000 9.453000 ( 9.531000)
lazy long match double 12.718000 0.000000 12.718000 ( 13.566000)
neg long match double long 9.407000 0.000000 9.407000 ( 9.442000)
lazy long match double long 12.500000 0.000000 12.500000 ( 12.777000)

robert@fussel /cygdrive/c/Temp

Notice how lazy is up to 30% slower for longer strings.
Also, best intro to regular expressions ever:

http://www.regular-expressions.info/tutorial.html

Good ref!

Kind regards

robert
 
T

Tod Beardsley

Hm, it seems line 13 and 18 are identical. Where's the lazy quantifier?

grr curse my copy paste skills. fixed. thanks for paying attention,
Robert. Your bm test is, of course, much more useful.
 
T

Tod Beardsley

Anyway, I think the moral of this particular long-missing story is, if
you can regex test for smaller anchors first, you can then fail to
match much faster. IOW:

matched =false
if str.match(/b/)
matched = true if str.match(/a[^b]*b/)
end
matched
 
R

Robert Klemme

2008/9/22 Tod Beardsley said:
Anyway, I think the moral of this particular long-missing story is, if
you can regex test for smaller anchors first, you can then fail to
match much faster. IOW:

matched =false
if str.match(/b/)
matched = true if str.match(/a[^b]*b/)
end
matched

I am not sure. This approach is likely slower than a single fast RX -
at least if you expect matches most of the time. It all depends...

Kind regards

robert
 
E

Ezra Zygmuntowicz

2008/9/22 Tod Beardsley said:
Anyway, I think the moral of this particular long-missing story is,
if
you can regex test for smaller anchors first, you can then fail to
match much faster. IOW:

matched =false
if str.match(/b/)
matched = true if str.match(/a[^b]*b/)
end
matched

I am not sure. This approach is likely slower than a single fast RX -
at least if you expect matches most of the time. It all depends...


Also keep in mind that =~ is generally a lot faster then .match since
match has to build the full MatchData object even if you do not use it.

Cheers-
-Ezra
 
B

Brian Candler

Also keep in mind that =~ is generally a lot faster then .match since
match has to build the full MatchData object even if you do not use it.

With =~ the MatchData can still be obtained from $~

Interestingly, not referencing the MatchData *does* give a big speed
improvement.

$ time ruby -e '5_000_000.times { /b/.match("abc") }'

real 0m28.699s
user 0m28.490s
sys 0m0.024s

$ time ruby -e '5_000_000.times { /b/ =~ "abc"; $~ }'

real 0m28.119s
user 0m27.910s
sys 0m0.024s

$ time ruby -e '5_000_000.times { /b/ =~ "abc" }'

real 0m14.311s
user 0m14.285s
sys 0m0.008s

$ ruby -v
ruby 1.8.6 (2008-03-03 patchlevel 114) [i686-linux]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,177
Latest member
OrderGlucea
Top