short regexp question

Fritzek · Sep 18, 2008

Hi folks

short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

tried to split using (/(\[|\])/) to get ["a", "[", "b", "]", "c"]

b=a.strip.split(/(\[|\])/)

and then joined the bits together. this just works in a simple case
like "ac", but "" could occur multiple times.

I need something like search for any "" and substitute with a
blank.

Thanks in advance

Fritzek

David A. Black · Sep 18, 2008

Hi --

Hi folks

short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

tried to split using (/(\[|\])/) to get ["a", "[", "b", "]", "c"]

b=a.strip.split(/(\[|\])/)

and then joined the bits together. this just works in a simple case
like "ac", but "" could occur multiple times.

I need something like search for any "" and substitute with a
blank.

b = a.delete("")

David

--
Rails training from David A. Black and Ruby Power and Light:
Intro to Ruby on Rails January 12-15 Fort Lauderdale, FL
Advancing with Rails January 19-22 Fort Lauderdale, FL *
* Co-taught with Patrick Ewing!
See http://www.rubypal.com for details and updates!

Fritzek · Sep 18, 2008

Hi David

thanks for quick answer. your code just works, if you know "b". I only
know the surrounding brackets "[" and "]" The bit in between could be
everything. sorry, forgot to mention.

Fritzek

Hi --

Hi folks

Click to expand...

short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

Click to expand...

tried to split using (/(\[|\])/) to get ["a", "[", "b", "]", "c"]

b=a.strip.split(/(\[|\])/)

Click to expand...

and then joined the bits together. this just works in a simple case
like "ac", but "" could occur multiple times.

Click to expand...

I need something like search for any "" and substitute with a
blank.

Click to expand...

b = a.delete("")

David

--
Rails training from David A. Black and Ruby Power and Light:
Intro to Ruby on Rails January 12-15 Fort Lauderdale, FL
Advancing with Rails January 19-22 Fort Lauderdale, FL *
* Co-taught with Patrick Ewing!
Seehttp://www.rubypal.comfor details and updates!

Brian Candler · Sep 18, 2008

I need something like search for any "" and substitute with a

blank.

Click to expand...

b = a.gsub(/\[b\]/,' ')

Brian Candler · Sep 18, 2008

b = a.gsub(/\[b\]/,' ')

Also possibly useful is for you:

a = "aaabbbccc"
bits = a.split(/\[b\]/)

Fritzek · Sep 18, 2008

Hi Brian

thanks for your answer. as I stated to David, I just know about the
surrounding brackets not the bits between them.

Fritzek

b = a.gsub(/\[b\]/,' ')

Click to expand...

Also possibly useful is for you:

a = "aaabbbccc"
bits = a.split(/\[b\]/)

Sebastian Hungerecker · Sep 18, 2008

Fritzek said:
short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

"ac".gsub(/\[.*?\]/, " ")

HTH,
Sebastian

Fritzek · Sep 18, 2008

Hi Sebastian

thanks for the solution. works perfect.

Fritzek

Fritzek said:
Fritzek said:

short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

Click to expand...

"ac".gsub(/\[.*?\]/, " ")

HTH,
Sebastian

Robert Klemme · Sep 18, 2008

Fritzek said:
Hi Sebastian

thanks for the solution. works perfect.

Fritzek

Fritzek said:

short question how to use regexp the right way
given is a="ac"
I nedd b="a c"

Click to expand...

"ac".gsub(/\[.*?\]/, " ")

Click to expand...

Not sure whether it makes a difference performance wise but I am always
reluctant to use reluctant quantifiers. I'd rather do

irb(main):003:0> "ac".gsub /\[[^\]]*\]/, ' '
=> "a c"

Kind regards

robert

Fritzek · Sep 19, 2008

Hi Robert

thanks for your objection, but could you shortly explain the
difference (for regexp dummies like me)?

Fritzek

Hi Sebastian

Click to expand...

thanks for the solution. works perfect.

Fritzek

Click to expand...

Fritzek wrote:
short question how to use regexp the right way
given is a="ac"
I nedd b="a c"
"ac".gsub(/\[.*?\]/, " ")

Click to expand...

Click to expand...

Not sure whether it makes a difference performance wise but I am always
reluctant to use reluctant quantifiers. I'd rather do

irb(main):003:0> "ac".gsub /\[[^\]]*\]/, ' '
=> "a c"

Kind regards

robert

Robert Klemme · Sep 19, 2008

2008/9/19 Fritzek said:
thanks for your objection, but could you shortly explain the
difference (for regexp dummies like me)?

Ideally you read "Mastering Regular Expressions" which explains such
topics very nicely.

I believe it is generally better to be more specific about what is to
match (mainly for robustness reasons). Also, with the reluctant
quantifier for every character in the input a match against the next
sub pattern needs to be tested OR there needs to be backtracking to
find out whether there is a shorter match afterwards. Both seem not
very efficient. Granted, this is no hard evidence, but if you are
curious I suggest you do some benchmarks and read the book; it's
really good!

Kind regards

robert

Fritzek · Sep 19, 2008

Hi Robert

thanks for explanation and book hint. will search for it.
Fritzek

Tod Beardsley · Sep 19, 2008

I wrote a quickie benchmark. CPU speed and compile options will
certainly influence your results.

http://snippets.dzone.com/posts/show/6098

Also, best intro to regular expressions ever:

http://www.regular-expressions.info/tutorial.html

Robert Klemme · Sep 20, 2008

2008/9/19 Tod Beardsley said:
I wrote a quickie benchmark. CPU speed and compile options will
certainly influence your results.

http://snippets.dzone.com/posts/show/6098

Hm, it seems line 13 and 18 are identical. Where's the lazy quantifier?

Here's what I'd consider a better benchmark, as it covers the
scenarios I was talking about, especially with situations where there
is a second potential end point ("b" in this case):

robert@fussel /cygdrive/c/Temp
$ cat l.rb
#!/bin/env ruby

require 'benchmark'

REP = 1_000
LONG = 1_000

STRINGS = [
["short match", "ab"],
["short mismatch", "a"],
["long match", "a" * LONG + "b"],
["long mismatch", "a" * LONG],
["short match double", "abab"],
["long match double", "a" * LONG + "bb"],
["long match double long", "a" * LONG + "b" + "a" * LONG + "b"],
]

Benchmark.bmbm(6 + STRINGS.inject(0) {|m,(a,b)| a.length > m ?
a.length : m }) do |b|
STRINGS.each do |label, str|
rep = /long mis/ =~ label ? 100 : 100_000

b.report "neg " + label do
rep.times { /a[^b]*b/ =~ str }
end

b.report "lazy " + label do
rep.times { /a.*?b/ =~ str }
end
end
end

robert@fussel /cygdrive/c/Temp
$ ./l.rb
Rehearsal ---------------------------------------------------------------
neg short match 0.282000 0.000000 0.282000 ( 0.288000)
lazy short match 0.297000 0.000000 0.297000 ( 0.284000)
neg short mismatch 0.328000 0.000000 0.328000 ( 0.341000)
lazy short mismatch 0.375000 0.000000 0.375000 ( 0.366000)
neg long match 9.531000 0.000000 9.531000 ( 9.982000)
lazy long match 12.625000 0.000000 12.625000 ( 12.764000)
neg long mismatch 4.672000 0.000000 4.672000 ( 4.742000)
lazy long mismatch 6.297000 0.000000 6.297000 ( 6.422000)
neg short match double 0.297000 0.000000 0.297000 ( 0.291000)
lazy short match double 0.281000 0.000000 0.281000 ( 0.287000)
neg long match double 9.406000 0.000000 9.406000 ( 9.443000)
lazy long match double 12.500000 0.000000 12.500000 ( 12.592000)
neg long match double long 9.516000 0.000000 9.516000 ( 9.642000)
lazy long match double long 12.547000 0.000000 12.547000 ( 12.745000)
----------------------------------------------------- total: 78.954000sec

user system total real
neg short match 0.312000 0.000000 0.312000 ( 0.305000)
lazy short match 0.297000 0.000000 0.297000 ( 0.301000)
neg short mismatch 0.375000 0.000000 0.375000 ( 0.388000)
lazy short mismatch 0.359000 0.000000 0.359000 ( 0.356000)
neg long match 9.344000 0.000000 9.344000 ( 9.637000)
lazy long match 12.547000 0.000000 12.547000 ( 12.777000)
neg long mismatch 4.703000 0.000000 4.703000 ( 4.783000)
lazy long mismatch 6.219000 0.000000 6.219000 ( 6.242000)
neg short match double 0.297000 0.000000 0.297000 ( 0.301000)
lazy short match double 0.297000 0.000000 0.297000 ( 0.297000)
neg long match double 9.453000 0.000000 9.453000 ( 9.531000)
lazy long match double 12.718000 0.000000 12.718000 ( 13.566000)
neg long match double long 9.407000 0.000000 9.407000 ( 9.442000)
lazy long match double long 12.500000 0.000000 12.500000 ( 12.777000)

robert@fussel /cygdrive/c/Temp

Notice how lazy is up to 30% slower for longer strings.

Also, best intro to regular expressions ever:

http://www.regular-expressions.info/tutorial.html

Good ref!

Kind regards

robert

Tod Beardsley · Sep 22, 2008

Hm, it seems line 13 and 18 are identical. Where's the lazy quantifier?

grr curse my copy paste skills. fixed. thanks for paying attention,
Robert. Your bm test is, of course, much more useful.

Tod Beardsley · Sep 22, 2008

Anyway, I think the moral of this particular long-missing story is, if
you can regex test for smaller anchors first, you can then fail to
match much faster. IOW:

matched =false
if str.match(/b/)
matched = true if str.match(/a[^b]*b/)
end
matched

Robert Klemme · Sep 22, 2008

2008/9/22 Tod Beardsley said:
Anyway, I think the moral of this particular long-missing story is, if
you can regex test for smaller anchors first, you can then fail to
match much faster. IOW:

matched =false
if str.match(/b/)
matched = true if str.match(/a[^b]*b/)
end
matched

I am not sure. This approach is likely slower than a single fast RX -
at least if you expect matches most of the time. It all depends...

Kind regards

robert

Ezra Zygmuntowicz · Sep 22, 2008

2008/9/22 Tod Beardsley said:
2008/9/22 Tod Beardsley said:

Anyway, I think the moral of this particular long-missing story is,
if
you can regex test for smaller anchors first, you can then fail to
match much faster. IOW:

matched =false
if str.match(/b/)
matched = true if str.match(/a[^b]*b/)
end
matched

Click to expand...

I am not sure. This approach is likely slower than a single fast RX -
at least if you expect matches most of the time. It all depends...

Also keep in mind that =~ is generally a lot faster then .match since
match has to build the full MatchData object even if you do not use it.

Cheers-
-Ezra

Brian Candler · Sep 23, 2008

Also keep in mind that =~ is generally a lot faster then .match since

match has to build the full MatchData object even if you do not use it.

With =~ the MatchData can still be obtained from $~

Interestingly, not referencing the MatchData *does* give a big speed
improvement.

$ time ruby -e '5_000_000.times { /b/.match("abc") }'

real 0m28.699s
user 0m28.490s
sys 0m0.024s

$ time ruby -e '5_000_000.times { /b/ =~ "abc"; $~ }'

real 0m28.119s
user 0m27.910s
sys 0m0.024s

$ time ruby -e '5_000_000.times { /b/ =~ "abc" }'

real 0m14.311s
user 0m14.285s
sys 0m0.008s

$ ruby -v
ruby 1.8.6 (2008-03-03 patchlevel 114) [i686-linux]

Noob question about mathematical addition vs. "string addition" in C#	1	Mar 6, 2022
Can someone tell me what's wrong with this question on StackOverflow?	0	Aug 19, 2023
Question about my projects	3	Jul 23, 2021
Simple regexp question	0	Oct 26, 2005
Newbie regexp question	5	Sep 15, 2006
Need a short nontrivial example program	7	Apr 9, 2010
Doing an AND in regexp char class	17	May 8, 2008
logic expressions & short circuit evaluation	2	Jul 15, 2010

short regexp question

Fritzek

David A. Black

Fritzek

Brian Candler

Brian Candler

Fritzek

Sebastian Hungerecker

Fritzek

Robert Klemme

Fritzek

Robert Klemme

Fritzek

Tod Beardsley

Robert Klemme

Tod Beardsley

Tod Beardsley

Robert Klemme

Ezra Zygmuntowicz

Brian Candler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads