regex, number of matches

Discussion in 'Perl Misc' started by Dr.Ruud, Sep 23, 2005.

  1. Dr.Ruud

    Dr.Ruud Guest

    #!/usr/local/bin/perl -wC

    use strict;

    my @text;
    $text[1] = "xxx xx xxx xxx";
    $text[2] = "yyy yyyy yyy yyy yyy";

    my ($chars, $words, $lines) = wc(@text);

    print "Chars: $chars\n";
    print "Words: $words\n";
    print "Lines: $lines\n";

    sub wc {
    my @ret;
    for (@_) {
    if (defined) {
    $ret[0] += length; # chars
    $ret[1] += () = /\S+/g; # words
    $ret[2] += 1; # lines
    }
    }
    return @ret;
    }


    What I found hard to get, is the role of the '()' in the wc-words-line:

    $ret[1] += () = /\S+/g; # words

    After a while, I understood it as an anonymous array that is filled with
    the matches, after which its length is used to increase the words-count.


    The creating and filling of () seemed like a waste of cpu-cycles, so I
    tried to find another way of counting the number of matches.

    Destructive variant:

    $ret[1] += s/\S+//g; # words

    I settled for:

    $ret[1] += 1 while /\S+/g; # words

    Is there a better/nicer/smarter/directer way to return the number of
    matches from a regex?

    See also http://dev.perl.org/perl6/rfc/110.html

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 23, 2005
    #1
    1. Advertising

  2. Dr.Ruud

    Brian Wakem Guest

    Dr.Ruud wrote:


    > $ret[1] += () = /\S+/g; # words
    >
    >
    > Destructive variant:
    >
    > $ret[1] += s/\S+//g; # words
    >
    > I settled for:
    >
    > $ret[1] += 1 while /\S+/g; # words
    >
    > Is there a better/nicer/smarter/directer way to return the number of
    > matches from a regex?



    This was covered a few week ago in a thread titled 'Space (\s) count
    problem'.

    The fastest way is to substitute a match with itself.

    I fired off an email to with this suggestion
    but it bounced.



    --
    Brian Wakem
    Email: http://homepage.ntlworld.com/b.wakem/myemail.png
     
    Brian Wakem, Sep 23, 2005
    #2
    1. Advertising

  3. Dr.Ruud

    John Bokma Guest

    "Dr.Ruud" <> wrote:

    > What I found hard to get, is the role of the '()' in the wc-words-line:
    >
    > $ret[1] += () = /\S+/g; # words


    it forces list context :)

    > After a while, I understood it as an anonymous array that is filled with
    > the matches, after which its length is used to increase the words-count.
    >
    > The creating and filling of () seemed like a waste of cpu-cycles,


    Maybe it's optimized away?

    > so I
    > tried to find another way of counting the number of matches.
    >
    > Destructive variant:
    >
    > $ret[1] += s/\S+//g; # words
    >
    > I settled for:
    >
    > $ret[1] += 1 while /\S+/g; # words



    Did you benchmark those?

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Sep 23, 2005
    #3
  4. Dr.Ruud wrote:
    >
    > What I found hard to get, is the role of the '()' in the wc-words-line:
    >
    > $ret[1] += () = /\S+/g; # words
    >
    > After a while, I understood it as an anonymous array that is filled with
    > the matches, after which its length is used to increase the words-count.


    If it had been an anonymous array it would have been:

    $ret[1] += @{[ /\S+/g ]}; # words



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Sep 23, 2005
    #4
  5. Dr.Ruud

    Dr.Ruud Guest

    John Bokma schreef:
    > Dr.Ruud:


    >> What I found hard to get, is the role of the '()' in
    >> $ret[1] += () = /\S+/g; # words

    >
    > it forces list context :)


    Yes, I'm starting to get that.


    >> The creating and filling of () seemed like a waste of cpu-cycles,

    >
    > Maybe it's optimized away?


    Well, maybe compare it to:

    $ret[1] += (@tmp = /\S+/g); # words

    but if tmp is not used afterwards, that also can be optimized.


    >> Destructive variant:
    >>
    >> $ret[1] += s/\S+//g; # words
    >>
    >> I settled for:
    >>
    >> $ret[1] += 1 while /\S+/g; # words

    >
    >
    > Did you benchmark those?


    No, I'm not in a hurry (yet).


    A new option for scalar mode would be the cleanest:

    $ret[1] += /\S+/t; # words

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 23, 2005
    #5
  6. Dr.Ruud

    Dr.Ruud Guest

    Brian Wakem:


    > The fastest way is to substitute a match with itself.


    OK. I had tested that, but I hated the looks, because it doesn't explain
    itself enough.

    The notion that the 'fake substitution' operates 'at the C level' is
    very convincing.


    Did you also benchmark "s!\Q$kw\E!$&!g" ?

    (You pay a price using &, see man perlre: $& is not so costly.)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 23, 2005
    #6
  7. Dr.Ruud

    John Bokma Guest

    "Dr.Ruud" <> wrote:

    > John Bokma schreef:


    >> Did you benchmark those?

    >
    > No, I'm not in a hurry (yet).


    :) Problem with benchmarking is what today seems to be a bad choice, can
    be a better choice tomorrow. An example: map in a void context was some
    time ago an expensive operation. IIRC it has been optimized (note that I
    don't say that "we" should all use map in a void context now :) )

    > A new option for scalar mode would be the cleanest:
    >
    > $ret[1] += /\S+/t; # words


    and t means? (Tellen :-D). The questions are: how often is this going to be
    used, and; is a new option really needed, or can we get away with what's
    available now and some documentation?

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Sep 23, 2005
    #7
  8. Dr.Ruud

    Dr.Ruud Guest

    Abigail:

    > I was surprised the while variant was the clear winner -
    > assigning to an empty list is hardly faster than assigning to an
    > array.


    That is what I expected.


    > And the code:


    Thanks for that too.

    How about:

    sub2 => '$sub2 = 0; $sub2 += s/\S+/$&/g for @data;'

    And how about the o-option (pre-compile), or doesn't that go with g?

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 24, 2005
    #8
  9. Dr.Ruud

    Dr.Ruud Guest

    Brian Wakem:
    > Dr.Ruud:
    >
    >
    >> $ret[1] += () = /\S+/g; # words
    >>
    >>
    >> Destructive variant:
    >>
    >> $ret[1] += s/\S+//g; # words
    >>
    >> I settled for:
    >>
    >> $ret[1] += 1 while /\S+/g; # words
    >>
    >> Is there a better/nicer/smarter/directer way to return the number of
    >> matches from a regex?

    >
    >
    > This was covered a few week ago in a thread titled 'Space (\s) count
    > problem'.
    >
    > The fastest way is to substitute a match with itself.


    As Abigail showed, there will be a difference between

    (1) s/$kw/$kw/g (add \Q and \E where needed)

    and

    (2) s/\S+/$&/g

    and

    (3) s/\S+//g


    The loops of both (1) and (3) are more 'constant' so will need less
    cycles than (2).
    (is my guess)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 24, 2005
    #9
  10. Dr.Ruud

    John Bokma Guest

    "Dr.Ruud" <> wrote:

    > sub2 => '$sub2 = 0; $sub2 += s/\S+/$&/g for @data;'
    >
    > And how about the o-option (pre-compile), or doesn't that go with g?


    o is IIRC only useful in some rare cases when you use a variabele in the
    regexp. Since your s/// is constant, I think it's compiled, optimized, etc.
    at the compile stage of your script, but again IIRC.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Sep 24, 2005
    #10
  11. Dr.Ruud

    Dr.Ruud Guest

    John Bokma:
    > Dr.Ruud:


    >> A new option for scalar mode would be the cleanest:
    >>
    >> $ret[1] += /\S+/t; # words

    >
    > and t means? (Tellen :-D).


    >>> See also http://dev.perl.org/perl6/rfc/110.html


    s/t/=/ if parseable

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 24, 2005
    #11
  12. Dr.Ruud

    Dr.Ruud Guest

    Abigail schreef:

    > [benchmark]
    > And the code:


    There is a problem with the benchmark, because perlre(1) says:

    WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
    the program, it has to provide them for every pattern match. This may
    substantially slow your program. Perl uses the same mechanism to pro-
    duce $1, $2, etc, so you also pay a price for each pattern that con-
    tains capturing parentheses. (To avoid this cost while retaining the
    grouping behaviour, use the extended regular expression "(?: ... )"
    instead.) But if you never use $&, $` or $', then patterns without
    capturing parentheses will not be penalized. So avoid $&, $', and $`
    if you can, but if you can't (and some algorithms really appreciate
    them), once you've used them once, use them at will, because you've
    already paid the price. As of 5.005, $& is not so costly as the other
    two.


    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 24, 2005
    #12
  13. Dr.Ruud

    Dr.Ruud Guest

    Abigail schreef:
    > Dr.Ruud:


    > {} As Abigail showed, there will be a difference between
    > {}
    > {} (1) s/$kw/$kw/g (add \Q and \E where needed)
    > {}
    > {} and
    > {}
    > {} (2) s/\S+/$&/g
    > {}
    > {} and
    > {}
    > {} (3) s/\S+//g
    > {}
    > {}
    > {} The loops of both (1) and (3) are more 'constant' so will need
    > less {} cycles than (2).
    >
    >
    > I did not show that,


    You're right. I first had 2 cases in there, and than added one and
    changed another and left the top line as is.


    > and I do not know what you mean by (1) and (3)
    > being "more constant" and hence needing less cycles.


    How little the regex changes for each iteration.

    The "$&" part varies in each iteration, but an optimization for that is
    feasible, since the search- and the replace-part can be detected as
    equal. If that optimization kicks in with (2), than (1) < (2) < (3) in
    "what needs to be done".

    Case (3) is very much like (2), only the input gets changed, so that
    will be the slowest.

    Case (1) deals with a different situation: a keyword-count in stead of a
    word-count.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 24, 2005
    #13
  14. Dr.Ruud

    Dr.Ruud Guest

    Abigail schreef:
    > Dr.Ruud:


    >> There is a problem with the benchmark, because [of what] perlre(1)

    says

    > I'm fully aware of the penalty associated with $& and friends.
    > But why does that cause a problem with the benchmark?


    My mistake again. I wrongly read the text about the "price for each
    pattern that contains capturing parentheses" as also penalizing other
    pattern matches.

    I find it hard to think of a reason why the first use of $& should harm
    all other pattern matches. And then why ()/$1 doesn't. Because they can
    be handled in about the same way. I haven't looked into the Perl-source
    yet, this is as good a reason as any.

    I like to see the benchmark without the ()/$1 test, to see if the
    remaining cases behave about the same. That might need absolute scores
    next to the percentages, so running on a system that doesn't do much
    else. I'll try to arrange that later today or tomorrow.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 24, 2005
    #14
  15. Dr.Ruud <> wrote:
    > Abigail schreef:
    >> Dr.Ruud:

    >
    >> {} As Abigail showed, there will be a difference between
    >> {}
    >> {} (1) s/$kw/$kw/g (add \Q and \E where needed)
    >> {}
    >> {} and
    >> {}
    >> {} (2) s/\S+/$&/g
    >> {}
    >> {} and
    >> {}
    >> {} (3) s/\S+//g



    >> and I do not know what you mean by (1) and (3)
    >> being "more constant" and hence needing less cycles.

    >
    > How little the regex changes for each iteration.



    The regex *never* changes for (2) and (3), so surely they
    must be "more constant"?


    > The "$&" part varies in each iteration,



    The "$&" part is not in the regular expression portion of s///,
    it is in the replacement (double-quotish) _string_ portion.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Sep 24, 2005
    #15
  16. Dr.Ruud <> wrote:

    > I find it hard to think of a reason why the first use of $& should harm
    > all other pattern matches.



    Because a whole bunch of characters must be stored for
    every (successful) pattern match.


    > And then why ()/$1 doesn't.



    Because a whole bunch of characters must be stored only for
    the (successful) pattern matches that explicitly mention them.


    One makes a lot of work for every pattern match, the other makes a
    lot of work for only some pattern matches.

    Optimizing away for "every" has to be a bigger win than opitimizing
    away only for "some".


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Sep 24, 2005
    #16
  17. Dr.Ruud

    Dr.Ruud Guest

    Tad McClellan:
    > Dr.Ruud:


    >> I find it hard to think of a reason why the first use of $& should
    >> harm all other pattern matches.

    >
    > Because a whole bunch of characters must be stored for
    > every (successful) pattern match.


    In the circumstances we were discussing, an offset plus length on the
    input buffer is all you need there (and those are known values).

    I think that the alternatives "s/(\S+)/$1/g" and "s/\S+/$&/g" don't need
    behave differently.

    Does something like "use caveats" exists, that warns against common
    missing rewrites?
    ;)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 25, 2005
    #17
  18. Dr.Ruud

    Dr.Ruud Guest

    Abigail schreef:
    > Dr.Ruud:


    >> In the circumstances we were discussing, an offset plus length on
    >> the input buffer is all you need there (and those are known
    >> values).

    >
    > But it's not easy to determine those special circumstances. In general
    > an offset won't do as the string might change.


    OK.

    >> I think that the alternatives "s/(\S+)/$1/g" and "s/\S+/$&/g"
    >> don't need to behave differently.

    >
    > True, if there are no further references to $&. But that's a special
    > case, and perl doesn't spend CPU cycles to find out. You're welcome
    > to provide a patch, although with all the warnings against using $&
    > already in the documentation, I'm not sure the patch will be accepted.


    Maybe as an example, coming with a general addition of optimization
    types...
    Or a syntax-coloring tool that highlights things that the documentation
    warns against.
    OK, let me read Apocalypse-5 first.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 25, 2005
    #18
  19. Dr.Ruud

    Dr.Ruud Guest

    Dr.Ruud schreef:

    > OK, let me read Apocalypse-5 first.


    Looks good:
    my $foo = +/.../; # numeric context, return count of matches

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 25, 2005
    #19
  20. Dr.Ruud

    Dr.Ruud Guest

    Abigail schreef:

    > if you use $& [...], Perl doesn't know in advance to
    > which regex they will refer (you need to be able to solve
    > the halting problem to determine this).


    I don't understand this yet. What is the cost of dealing with this per
    regex?


    > [benchmark]
    > Rate sub list2 list while
    > sub 2619/s -- -23% -26% -61%
    > list2 3398/s 30% -- -4% -49%
    > list 3523/s 34% 4% -- -47%
    > while 6698/s 156% 97% 90% --
    >
    > Rate sub list2 list while
    > sub 2559/s -- -23% -28% -44%
    > list2 3339/s 30% -- -6% -28%
    > list 3556/s 39% 6% -- -23%
    > while 4610/s 80% 38% 30% --
    >
    > Note the difference in iteration rate of the while - it dropped almost
    > by a third.


    Yep, (6698-4610)/6698 = 31%.


    > Note also that the 'sub', 'list' and 'list2' regexes run at almost the
    > same rate - this is because they use parenthesis (the list ones have
    > implicate parens because the /\S+/g is in list context, and doesn't
    > have parens itself). But the 'while /\S+/g' doesn't have parens, and
    > is penalized because of the $&.


    Nice show again!

    xs6-results (v5.8.6 built for i386-freebsd-64int)

    Rate sub list2 list while
    sub 2965/s -- -39% -43% -65%
    list2 4874/s 64% -- -7% -42%
    list 5245/s 77% 8% -- -38%
    while 8404/s 183% 72% 60% --

    Rate sub list2 list while
    sub 2912/s -- -39% -43% -57%
    list2 4738/s 63% -- -8% -30%
    list 5143/s 77% 9% -- -24%
    while 6796/s 133% 43% 32% --

    (8404-6796)/8404 = 20%.


    xs1-results (idem)

    Rate sub list2 list while
    sub 5357/s -- -23% -27% -52%
    list2 6990/s 30% -- -4% -38%
    list 7303/s 36% 4% -- -35%
    while 11263/s 110% 61% 54% --

    Rate sub list2 list while
    sub 5397/s -- -21% -23% -33%
    list2 6817/s 26% -- -2% -16%
    list 6990/s 30% 3% -- -14%
    while 8095/s 50% 19% 16% --

    (11263-8095)/11263 = 28%


    I tried some
    subst-variants like "/()(\S+)/$2/g" and "/((\S+))/$2/g" and
    while-variants like "/\S+\s*/" and "/\G\s*\S+/g"
    just to see how they behave: no spectacular results.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 25, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stephan Bour

    Extracting matches from Regex.Split

    Stephan Bour, Oct 29, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    2,565
    Stephan Bour
    Oct 30, 2003
  2. darrel
    Replies:
    1
    Views:
    798
    Blair Bonnett
    Jan 3, 2005
  3. Replies:
    4
    Views:
    1,589
  4. Talin
    Replies:
    3
    Views:
    485
    Talin
    Nov 19, 2005
  5. Ingo Weiss

    Get Number of regex matches

    Ingo Weiss, Dec 6, 2006, in forum: Ruby
    Replies:
    5
    Views:
    207
    Ingo Weiss
    Dec 7, 2006
Loading...

Share This Page