regex, number of matches

Dr.Ruud · Sep 23, 2005

#!/usr/local/bin/perl -wC

use strict;

my @text;
$text[1] = "xxx xx xxx xxx";
$text[2] = "yyy yyyy yyy yyy yyy";

my ($chars, $words, $lines) = wc(@text);

print "Chars: $chars\n";
print "Words: $words\n";
print "Lines: $lines\n";

sub wc {
my @ret;
for (@_) {
if (defined) {
$ret[0] += length; # chars
$ret[1] += () = /\S+/g; # words
$ret[2] += 1; # lines
}
}
return @ret;
}

What I found hard to get, is the role of the '()' in the wc-words-line:

$ret[1] += () = /\S+/g; # words

After a while, I understood it as an anonymous array that is filled with
the matches, after which its length is used to increase the words-count.

The creating and filling of () seemed like a waste of cpu-cycles, so I
tried to find another way of counting the number of matches.

Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words

Is there a better/nicer/smarter/directer way to return the number of
matches from a regex?

See also http://dev.perl.org/perl6/rfc/110.html

Brian Wakem · Sep 23, 2005

Dr.Ruud wrote:

$ret[1] += () = /\S+/g; # words

Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words

Is there a better/nicer/smarter/directer way to return the number of
matches from a regex?

This was covered a few week ago in a thread titled 'Space (\s) count
problem'.

The fastest way is to substitute a match with itself.

I fired off an email to (e-mail address removed) with this suggestion
but it bounced.

John Bokma · Sep 23, 2005

Dr.Ruud said:
What I found hard to get, is the role of the '()' in the wc-words-line:

$ret[1] += () = /\S+/g; # words

it forces list context

After a while, I understood it as an anonymous array that is filled with
the matches, after which its length is used to increase the words-count.

The creating and filling of () seemed like a waste of cpu-cycles,

Maybe it's optimized away?

so I
tried to find another way of counting the number of matches.

Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words

Did you benchmark those?

John W. Krahn · Sep 23, 2005

Dr.Ruud said:
What I found hard to get, is the role of the '()' in the wc-words-line:

$ret[1] += () = /\S+/g; # words

After a while, I understood it as an anonymous array that is filled with
the matches, after which its length is used to increase the words-count.

If it had been an anonymous array it would have been:

$ret[1] += @{[ /\S+/g ]}; # words

John

Dr.Ruud · Sep 23, 2005

John Bokma schreef:

Dr.Ruud:

What I found hard to get, is the role of the '()' in
$ret[1] += () = /\S+/g; # words

Click to expand...

it forces list context

Yes, I'm starting to get that.

Maybe it's optimized away?

Well, maybe compare it to:

$ret[1] += (@tmp = /\S+/g); # words

but if tmp is not used afterwards, that also can be optimized.

Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words

Click to expand...

Did you benchmark those?

No, I'm not in a hurry (yet).

A new option for scalar mode would be the cleanest:

$ret[1] += /\S+/t; # words

Dr.Ruud · Sep 23, 2005

Brian Wakem:

The fastest way is to substitute a match with itself.

OK. I had tested that, but I hated the looks, because it doesn't explain
itself enough.

The notion that the 'fake substitution' operates 'at the C level' is
very convincing.

Did you also benchmark "s!\Q$kw\E!$&!g" ?

(You pay a price using &, see man perlre: $& is not so costly.)

John Bokma · Sep 23, 2005

Dr.Ruud said:
John Bokma schreef:

No, I'm not in a hurry (yet).

Problem with benchmarking is what today seems to be a bad choice, can
be a better choice tomorrow. An example: map in a void context was some
time ago an expensive operation. IIRC it has been optimized (note that I
don't say that "we" should all use map in a void context now

)

A new option for scalar mode would be the cleanest:

$ret[1] += /\S+/t; # words

and t means? (Tellen :-D). The questions are: how often is this going to be
used, and; is a new option really needed, or can we get away with what's
available now and some documentation?

Dr.Ruud · Sep 24, 2005

Abigail:

I was surprised the while variant was the clear winner -
assigning to an empty list is hardly faster than assigning to an
array.

That is what I expected.

And the code:

Thanks for that too.

How about:

sub2 => '$sub2 = 0; $sub2 += s/\S+/$&/g for @data;'

And how about the o-option (pre-compile), or doesn't that go with g?

Dr.Ruud · Sep 24, 2005

Brian Wakem:

Dr.Ruud:

$ret[1] += () = /\S+/g; # words

Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words

Is there a better/nicer/smarter/directer way to return the number of
matches from a regex?

Click to expand...

This was covered a few week ago in a thread titled 'Space (\s) count
problem'.

The fastest way is to substitute a match with itself.

As Abigail showed, there will be a difference between

(1) s/$kw/$kw/g (add \Q and \E where needed)

and

(2) s/\S+/$&/g

and

(3) s/\S+//g

The loops of both (1) and (3) are more 'constant' so will need less
cycles than (2).
(is my guess)

John Bokma · Sep 24, 2005

Dr.Ruud said:
sub2 => '$sub2 = 0; $sub2 += s/\S+/$&/g for @data;'

And how about the o-option (pre-compile), or doesn't that go with g?

o is IIRC only useful in some rare cases when you use a variabele in the
regexp. Since your s/// is constant, I think it's compiled, optimized, etc.
at the compile stage of your script, but again IIRC.

Dr.Ruud · Sep 24, 2005

John Bokma:

Dr.Ruud:

A new option for scalar mode would be the cleanest:

$ret[1] += /\S+/t; # words

Click to expand...

and t means? (Tellen :-D).

s/t/=/ if parseable

Dr.Ruud · Sep 24, 2005

Abigail schreef:

[benchmark]
And the code:

There is a problem with the benchmark, because perlre(1) says:

WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
the program, it has to provide them for every pattern match. This may
substantially slow your program. Perl uses the same mechanism to pro-
duce $1, $2, etc, so you also pay a price for each pattern that con-
tains capturing parentheses. (To avoid this cost while retaining the
grouping behaviour, use the extended regular expression "(?: ... )"
instead.) But if you never use $&, $` or $', then patterns without
capturing parentheses will not be penalized. So avoid $&, $', and $`
if you can, but if you can't (and some algorithms really appreciate
them), once you've used them once, use them at will, because you've
already paid the price. As of 5.005, $& is not so costly as the other
two.

Dr.Ruud · Sep 24, 2005

Abigail schreef:

Dr.Ruud:

{} As Abigail showed, there will be a difference between
{}
{} (1) s/$kw/$kw/g (add \Q and \E where needed)
{}
{} and
{}
{} (2) s/\S+/$&/g
{}
{} and
{}
{} (3) s/\S+//g
{}
{}
{} The loops of both (1) and (3) are more 'constant' so will need
less {} cycles than (2).

I did not show that,

You're right. I first had 2 cases in there, and than added one and
changed another and left the top line as is.

and I do not know what you mean by (1) and (3)
being "more constant" and hence needing less cycles.

How little the regex changes for each iteration.

The "$&" part varies in each iteration, but an optimization for that is
feasible, since the search- and the replace-part can be detected as
equal. If that optimization kicks in with (2), than (1) < (2) < (3) in
"what needs to be done".

Case (3) is very much like (2), only the input gets changed, so that
will be the slowest.

Case (1) deals with a different situation: a keyword-count in stead of a
word-count.

Dr.Ruud · Sep 24, 2005

Abigail schreef:

Dr.Ruud:

There is a problem with the benchmark, because [of what] perlre(1)

Click to expand...

says

I'm fully aware of the penalty associated with $& and friends.
But why does that cause a problem with the benchmark?

My mistake again. I wrongly read the text about the "price for each
pattern that contains capturing parentheses" as also penalizing other
pattern matches.

I find it hard to think of a reason why the first use of $& should harm
all other pattern matches. And then why ()/$1 doesn't. Because they can
be handled in about the same way. I haven't looked into the Perl-source
yet, this is as good a reason as any.

I like to see the benchmark without the ()/$1 test, to see if the
remaining cases behave about the same. That might need absolute scores
next to the percentages, so running on a system that doesn't do much
else. I'll try to arrange that later today or tomorrow.

Tad McClellan · Sep 24, 2005

Dr.Ruud said:
Abigail schreef:

How little the regex changes for each iteration.

The regex *never* changes for (2) and (3), so surely they
must be "more constant"?

The "$&" part varies in each iteration,

The "$&" part is not in the regular expression portion of s///,
it is in the replacement (double-quotish) _string_ portion.

Tad McClellan · Sep 24, 2005

Dr.Ruud said:
I find it hard to think of a reason why the first use of $& should harm
all other pattern matches.

Because a whole bunch of characters must be stored for
every (successful) pattern match.

And then why ()/$1 doesn't.

Because a whole bunch of characters must be stored only for
the (successful) pattern matches that explicitly mention them.

One makes a lot of work for every pattern match, the other makes a
lot of work for only some pattern matches.

Optimizing away for "every" has to be a bigger win than opitimizing
away only for "some".

Dr.Ruud · Sep 25, 2005

Tad McClellan:

Dr.Ruud:

Because a whole bunch of characters must be stored for
every (successful) pattern match.

In the circumstances we were discussing, an offset plus length on the
input buffer is all you need there (and those are known values).

I think that the alternatives "s/(\S+)/$1/g" and "s/\S+/$&/g" don't need
behave differently.

Does something like "use caveats" exists, that warns against common
missing rewrites?

Dr.Ruud · Sep 25, 2005

Abigail schreef:

Dr.Ruud:

But it's not easy to determine those special circumstances. In general
an offset won't do as the string might change.
OK.

True, if there are no further references to $&. But that's a special
case, and perl doesn't spend CPU cycles to find out. You're welcome
to provide a patch, although with all the warnings against using $&
already in the documentation, I'm not sure the patch will be accepted.

Maybe as an example, coming with a general addition of optimization
types...
Or a syntax-coloring tool that highlights things that the documentation
warns against.
OK, let me read Apocalypse-5 first.

Dr.Ruud · Sep 25, 2005

Dr.Ruud schreef:

OK, let me read Apocalypse-5 first.

Looks good:
my $foo = +/.../; # numeric context, return count of matches

Dr.Ruud · Sep 25, 2005

Abigail schreef:

if you use $& [...], Perl doesn't know in advance to
which regex they will refer (you need to be able to solve
the halting problem to determine this).

I don't understand this yet. What is the cost of dealing with this per
regex?

[benchmark]
Rate sub list2 list while
sub 2619/s -- -23% -26% -61%
list2 3398/s 30% -- -4% -49%
list 3523/s 34% 4% -- -47%
while 6698/s 156% 97% 90% --

Rate sub list2 list while
sub 2559/s -- -23% -28% -44%
list2 3339/s 30% -- -6% -28%
list 3556/s 39% 6% -- -23%
while 4610/s 80% 38% 30% --

Note the difference in iteration rate of the while - it dropped almost
by a third.

Yep, (6698-4610)/6698 = 31%.

Note also that the 'sub', 'list' and 'list2' regexes run at almost the
same rate - this is because they use parenthesis (the list ones have
implicate parens because the /\S+/g is in list context, and doesn't
have parens itself). But the 'while /\S+/g' doesn't have parens, and
is penalized because of the $&.

Nice show again!

xs6-results (v5.8.6 built for i386-freebsd-64int)

Rate sub list2 list while
sub 2965/s -- -39% -43% -65%
list2 4874/s 64% -- -7% -42%
list 5245/s 77% 8% -- -38%
while 8404/s 183% 72% 60% --

Rate sub list2 list while
sub 2912/s -- -39% -43% -57%
list2 4738/s 63% -- -8% -30%
list 5143/s 77% 9% -- -24%
while 6796/s 133% 43% 32% --

(8404-6796)/8404 = 20%.

xs1-results (idem)

Rate sub list2 list while
sub 5357/s -- -23% -27% -52%
list2 6990/s 30% -- -4% -38%
list 7303/s 36% 4% -- -35%
while 11263/s 110% 61% 54% --

Rate sub list2 list while
sub 5397/s -- -21% -23% -33%
list2 6817/s 26% -- -2% -16%
list 6990/s 30% 3% -- -14%
while 8095/s 50% 19% 16% --

(11263-8095)/11263 = 28%

I tried some
subst-variants like "/()(\S+)/$2/g" and "/((\S+))/$2/g" and
while-variants like "/\S+\s*/" and "/\G\s*\S+/g"
just to see how they behave: no spectacular results.

Sort by number of characters	1	Nov 2, 2023
Help in hangman game	1	Jul 24, 2023
TF-IDF	1	Aug 19, 2021
Simplify Variable Number of Regex Groups	7	Jun 15, 2007
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
number of starting tabs	15	Aug 16, 2012
Genetic algoritm generating the text	0	Aug 18, 2023
Get Number of regex matches	5	Dec 6, 2006

regex, number of matches

Dr.Ruud

Brian Wakem

John Bokma

John W. Krahn

Dr.Ruud

Dr.Ruud

John Bokma

Dr.Ruud

Dr.Ruud

John Bokma

Dr.Ruud

Dr.Ruud

Dr.Ruud

Dr.Ruud

Tad McClellan

Tad McClellan

Dr.Ruud

Dr.Ruud

Dr.Ruud

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads