regex, number of matches

D

Dr.Ruud

#!/usr/local/bin/perl -wC

use strict;

my @text;
$text[1] = "xxx xx xxx xxx";
$text[2] = "yyy yyyy yyy yyy yyy";

my ($chars, $words, $lines) = wc(@text);

print "Chars: $chars\n";
print "Words: $words\n";
print "Lines: $lines\n";

sub wc {
my @ret;
for (@_) {
if (defined) {
$ret[0] += length; # chars
$ret[1] += () = /\S+/g; # words
$ret[2] += 1; # lines
}
}
return @ret;
}


What I found hard to get, is the role of the '()' in the wc-words-line:

$ret[1] += () = /\S+/g; # words

After a while, I understood it as an anonymous array that is filled with
the matches, after which its length is used to increase the words-count.


The creating and filling of () seemed like a waste of cpu-cycles, so I
tried to find another way of counting the number of matches.

Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words

Is there a better/nicer/smarter/directer way to return the number of
matches from a regex?

See also http://dev.perl.org/perl6/rfc/110.html
 
B

Brian Wakem

Dr.Ruud wrote:

$ret[1] += () = /\S+/g; # words


Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words

Is there a better/nicer/smarter/directer way to return the number of
matches from a regex?


This was covered a few week ago in a thread titled 'Space (\s) count
problem'.

The fastest way is to substitute a match with itself.

I fired off an email to (e-mail address removed) with this suggestion
but it bounced.
 
J

John Bokma

Dr.Ruud said:
What I found hard to get, is the role of the '()' in the wc-words-line:

$ret[1] += () = /\S+/g; # words

it forces list context :)
After a while, I understood it as an anonymous array that is filled with
the matches, after which its length is used to increase the words-count.

The creating and filling of () seemed like a waste of cpu-cycles,

Maybe it's optimized away?
so I
tried to find another way of counting the number of matches.

Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words


Did you benchmark those?
 
J

John W. Krahn

Dr.Ruud said:
What I found hard to get, is the role of the '()' in the wc-words-line:

$ret[1] += () = /\S+/g; # words

After a while, I understood it as an anonymous array that is filled with
the matches, after which its length is used to increase the words-count.

If it had been an anonymous array it would have been:

$ret[1] += @{[ /\S+/g ]}; # words



John
 
D

Dr.Ruud

John Bokma schreef:
Dr.Ruud:
What I found hard to get, is the role of the '()' in
$ret[1] += () = /\S+/g; # words

it forces list context :)

Yes, I'm starting to get that.

Maybe it's optimized away?

Well, maybe compare it to:

$ret[1] += (@tmp = /\S+/g); # words

but if tmp is not used afterwards, that also can be optimized.

Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words


Did you benchmark those?

No, I'm not in a hurry (yet).


A new option for scalar mode would be the cleanest:

$ret[1] += /\S+/t; # words
 
D

Dr.Ruud

Brian Wakem:

The fastest way is to substitute a match with itself.

OK. I had tested that, but I hated the looks, because it doesn't explain
itself enough.

The notion that the 'fake substitution' operates 'at the C level' is
very convincing.


Did you also benchmark "s!\Q$kw\E!$&!g" ?

(You pay a price using &, see man perlre: $& is not so costly.)
 
J

John Bokma

Dr.Ruud said:
John Bokma schreef:

No, I'm not in a hurry (yet).

:) Problem with benchmarking is what today seems to be a bad choice, can
be a better choice tomorrow. An example: map in a void context was some
time ago an expensive operation. IIRC it has been optimized (note that I
don't say that "we" should all use map in a void context now :) )
A new option for scalar mode would be the cleanest:

$ret[1] += /\S+/t; # words

and t means? (Tellen :-D). The questions are: how often is this going to be
used, and; is a new option really needed, or can we get away with what's
available now and some documentation?
 
D

Dr.Ruud

Abigail:
I was surprised the while variant was the clear winner -
assigning to an empty list is hardly faster than assigning to an
array.

That is what I expected.

And the code:

Thanks for that too.

How about:

sub2 => '$sub2 = 0; $sub2 += s/\S+/$&/g for @data;'

And how about the o-option (pre-compile), or doesn't that go with g?
 
D

Dr.Ruud

Brian Wakem:
Dr.Ruud:

$ret[1] += () = /\S+/g; # words


Destructive variant:

$ret[1] += s/\S+//g; # words

I settled for:

$ret[1] += 1 while /\S+/g; # words

Is there a better/nicer/smarter/directer way to return the number of
matches from a regex?


This was covered a few week ago in a thread titled 'Space (\s) count
problem'.

The fastest way is to substitute a match with itself.

As Abigail showed, there will be a difference between

(1) s/$kw/$kw/g (add \Q and \E where needed)

and

(2) s/\S+/$&/g

and

(3) s/\S+//g


The loops of both (1) and (3) are more 'constant' so will need less
cycles than (2).
(is my guess)
 
J

John Bokma

Dr.Ruud said:
sub2 => '$sub2 = 0; $sub2 += s/\S+/$&/g for @data;'

And how about the o-option (pre-compile), or doesn't that go with g?

o is IIRC only useful in some rare cases when you use a variabele in the
regexp. Since your s/// is constant, I think it's compiled, optimized, etc.
at the compile stage of your script, but again IIRC.
 
D

Dr.Ruud

Abigail schreef:
[benchmark]
And the code:

There is a problem with the benchmark, because perlre(1) says:

WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
the program, it has to provide them for every pattern match. This may
substantially slow your program. Perl uses the same mechanism to pro-
duce $1, $2, etc, so you also pay a price for each pattern that con-
tains capturing parentheses. (To avoid this cost while retaining the
grouping behaviour, use the extended regular expression "(?: ... )"
instead.) But if you never use $&, $` or $', then patterns without
capturing parentheses will not be penalized. So avoid $&, $', and $`
if you can, but if you can't (and some algorithms really appreciate
them), once you've used them once, use them at will, because you've
already paid the price. As of 5.005, $& is not so costly as the other
two.
 
D

Dr.Ruud

Abigail schreef:
{} As Abigail showed, there will be a difference between
{}
{} (1) s/$kw/$kw/g (add \Q and \E where needed)
{}
{} and
{}
{} (2) s/\S+/$&/g
{}
{} and
{}
{} (3) s/\S+//g
{}
{}
{} The loops of both (1) and (3) are more 'constant' so will need
less {} cycles than (2).


I did not show that,

You're right. I first had 2 cases in there, and than added one and
changed another and left the top line as is.

and I do not know what you mean by (1) and (3)
being "more constant" and hence needing less cycles.

How little the regex changes for each iteration.

The "$&" part varies in each iteration, but an optimization for that is
feasible, since the search- and the replace-part can be detected as
equal. If that optimization kicks in with (2), than (1) < (2) < (3) in
"what needs to be done".

Case (3) is very much like (2), only the input gets changed, so that
will be the slowest.

Case (1) deals with a different situation: a keyword-count in stead of a
word-count.
 
D

Dr.Ruud

Abigail schreef:
Dr.Ruud:
There is a problem with the benchmark, because [of what] perlre(1)
says

I'm fully aware of the penalty associated with $& and friends.
But why does that cause a problem with the benchmark?

My mistake again. I wrongly read the text about the "price for each
pattern that contains capturing parentheses" as also penalizing other
pattern matches.

I find it hard to think of a reason why the first use of $& should harm
all other pattern matches. And then why ()/$1 doesn't. Because they can
be handled in about the same way. I haven't looked into the Perl-source
yet, this is as good a reason as any.

I like to see the benchmark without the ()/$1 test, to see if the
remaining cases behave about the same. That might need absolute scores
next to the percentages, so running on a system that doesn't do much
else. I'll try to arrange that later today or tomorrow.
 
T

Tad McClellan

Dr.Ruud said:
Abigail schreef:



How little the regex changes for each iteration.


The regex *never* changes for (2) and (3), so surely they
must be "more constant"?

The "$&" part varies in each iteration,


The "$&" part is not in the regular expression portion of s///,
it is in the replacement (double-quotish) _string_ portion.
 
T

Tad McClellan

Dr.Ruud said:
I find it hard to think of a reason why the first use of $& should harm
all other pattern matches.


Because a whole bunch of characters must be stored for
every (successful) pattern match.

And then why ()/$1 doesn't.


Because a whole bunch of characters must be stored only for
the (successful) pattern matches that explicitly mention them.


One makes a lot of work for every pattern match, the other makes a
lot of work for only some pattern matches.

Optimizing away for "every" has to be a bigger win than opitimizing
away only for "some".
 
D

Dr.Ruud

Tad McClellan:
Dr.Ruud:

Because a whole bunch of characters must be stored for
every (successful) pattern match.

In the circumstances we were discussing, an offset plus length on the
input buffer is all you need there (and those are known values).

I think that the alternatives "s/(\S+)/$1/g" and "s/\S+/$&/g" don't need
behave differently.

Does something like "use caveats" exists, that warns against common
missing rewrites?
;)
 
D

Dr.Ruud

Abigail schreef:
Dr.Ruud:

But it's not easy to determine those special circumstances. In general
an offset won't do as the string might change.
OK.


True, if there are no further references to $&. But that's a special
case, and perl doesn't spend CPU cycles to find out. You're welcome
to provide a patch, although with all the warnings against using $&
already in the documentation, I'm not sure the patch will be accepted.

Maybe as an example, coming with a general addition of optimization
types...
Or a syntax-coloring tool that highlights things that the documentation
warns against.
OK, let me read Apocalypse-5 first.
 
D

Dr.Ruud

Abigail schreef:
if you use $& [...], Perl doesn't know in advance to
which regex they will refer (you need to be able to solve
the halting problem to determine this).

I don't understand this yet. What is the cost of dealing with this per
regex?

[benchmark]
Rate sub list2 list while
sub 2619/s -- -23% -26% -61%
list2 3398/s 30% -- -4% -49%
list 3523/s 34% 4% -- -47%
while 6698/s 156% 97% 90% --

Rate sub list2 list while
sub 2559/s -- -23% -28% -44%
list2 3339/s 30% -- -6% -28%
list 3556/s 39% 6% -- -23%
while 4610/s 80% 38% 30% --

Note the difference in iteration rate of the while - it dropped almost
by a third.

Yep, (6698-4610)/6698 = 31%.

Note also that the 'sub', 'list' and 'list2' regexes run at almost the
same rate - this is because they use parenthesis (the list ones have
implicate parens because the /\S+/g is in list context, and doesn't
have parens itself). But the 'while /\S+/g' doesn't have parens, and
is penalized because of the $&.

Nice show again!

xs6-results (v5.8.6 built for i386-freebsd-64int)

Rate sub list2 list while
sub 2965/s -- -39% -43% -65%
list2 4874/s 64% -- -7% -42%
list 5245/s 77% 8% -- -38%
while 8404/s 183% 72% 60% --

Rate sub list2 list while
sub 2912/s -- -39% -43% -57%
list2 4738/s 63% -- -8% -30%
list 5143/s 77% 9% -- -24%
while 6796/s 133% 43% 32% --

(8404-6796)/8404 = 20%.


xs1-results (idem)

Rate sub list2 list while
sub 5357/s -- -23% -27% -52%
list2 6990/s 30% -- -4% -38%
list 7303/s 36% 4% -- -35%
while 11263/s 110% 61% 54% --

Rate sub list2 list while
sub 5397/s -- -21% -23% -33%
list2 6817/s 26% -- -2% -16%
list 6990/s 30% 3% -- -14%
while 8095/s 50% 19% 16% --

(11263-8095)/11263 = 28%


I tried some
subst-variants like "/()(\S+)/$2/g" and "/((\S+))/$2/g" and
while-variants like "/\S+\s*/" and "/\G\s*\S+/g"
just to see how they behave: no spectacular results.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,438
Messages
2,571,699
Members
48,796
Latest member
Greg L.
Top