Controlling the value returned in $1

Discussion in 'Perl Misc' started by Andre Majorel, Dec 17, 2006.

  1. Is there a way to override the value returned by a capture so
    that $1 is set not to the characters matched by the parentheses
    but some arbitrary string or number ? I'm thinking of something
    like this :

    $ perl -e '
    sub what ($)
    {
    if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
    {
    print "the string \"$_[0]\" matched a \"$1\"\n";
    }
    }

    what ("123");
    what ("abc123");
    '
    the string "123" matched a "number"
    the string "abc123" matched a "word"

    --
    André Majorel <URL:http://www.teaser.fr/~amajorel/>
    (Counterfeit: )
    Religion: a magic device for turning unanswerable questions into
    unquestionable answers. -- Art Gecko
     
    Andre Majorel, Dec 17, 2006
    #1
    1. Advertising

  2. On Sun, 17 Dec 2006 11:47:21 +0000 (UTC), Andre Majorel
    <> wrote:

    >Is there a way to override the value returned by a capture so
    >that $1 is set not to the characters matched by the parentheses
    >but some arbitrary string or number ? I'm thinking of something
    >like this :


    Smell of XY problem here, why do you want to do so?

    >$ perl -e '
    > sub what ($)


    Hardly any need for prototypes. So much more in a minimal example. No
    need for a sub altogether, to say the truth...

    > {
    > if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
    > {
    > print "the string \"$_[0]\" matched a \"$1\"\n";


    Incidentally, you can use alternate delimiters:

    print qq|the string "$_[0]" matched a "$1"\n|;

    >the string "123" matched a "number"
    >the string "abc123" matched a "word"


    Usual answer: don't use one regex where two (or more) would better fit
    the bill.


    Michele
    --
    {$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
    (($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
    ..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
    256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
     
    Michele Dondi, Dec 17, 2006
    #2
    1. Advertising

  3. Andre Majorel

    Paul Lalli Guest

    Andre Majorel wrote:
    > Is there a way to override the value returned by a capture so
    > that $1 is set not to the characters matched by the parentheses
    > but some arbitrary string or number ?


    What is it you're actually *trying* to do, that you've decided this is
    the correct solution to?

    > I'm thinking of something like this :
    >
    > $ perl -e '
    > sub what ($)
    > {
    > if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
    > {
    > print "the string \"$_[0]\" matched a \"$1\"\n";
    > }
    > }
    >
    > what ("123");
    > what ("abc123");
    > '
    > the string "123" matched a "number"
    > the string "abc123" matched a "word"


    perl -le'
    my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
    for my $s (qw/123 abc123/) {
    for my $t (@type_of) {
    if ($s =~ /^$t->[0]$/) {
    print "$s matches a $t->[1]";
    last;
    }
    }
    }

    '
     
    Paul Lalli, Dec 17, 2006
    #3
  4. On 2006-12-17, Paul Lalli <> wrote:
    > Andre Majorel wrote:
    >> Is there a way to override the value returned by a capture so
    >> that $1 is set not to the characters matched by the parentheses
    >> but some arbitrary string or number ?

    >
    > What is it you're actually *trying* to do, that you've decided this is
    > the correct solution to?


    The goal is to know whether a relatively long string matches one
    of several regexps (that's the easy part) and, if possible,
    which regexp it matched.

    > perl -le'
    > my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
    > for my $s (qw/123 abc123/) {
    > for my $t (@type_of) {
    > if ($s =~ /^$t->[0]$/) {
    > print "$s matches a $t->[1]";
    > last;
    > }
    > }
    > }


    But then, you scan the data once for each regexp. Wouldn't the
    execution time of /r0/ || /r1/ || ... || /rN/ approach N times
    the execution time of /r0|r1|...|rN/ as N gets bigger ?

    --
    André Majorel <URL:http://www.teaser.fr/~amajorel/>
    (Counterfeit: )
    Religion: a magic device for turning unanswerable questions into
    unquestionable answers. -- Art Gecko
     
    Andre Majorel, Dec 17, 2006
    #4
  5. Andre Majorel

    Xicheng Jia Guest

    Andre Majorel wrote:
    > Is there a way to override the value returned by a capture so
    > that $1 is set not to the characters matched by the parentheses
    > but some arbitrary string or number ? I'm thinking of something
    > like this :
    >
    > $ perl -e '
    > sub what ($)
    > {
    > if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
    > {
    > print "the string \"$_[0]\" matched a \"$1\"\n";
    > }
    > }
    >
    > what ("123");
    > what ("abc123");
    > '


    How about this:

    perl -e '
    sub what ($)
    {
    if ($_[0] =~ /^(?:\d+(?{"number"})|\w+(?{"word"}))$/) # Fictitious
    syntax
    {
    print qq(the string "$_[0]" matched a "$^R" \n);
    }
    }
    what ("123");
    what ("abc123");
    '
    (Be sure to use two anchors '^', and '$' in your pattern)

    Regards,
    Xicheng

    > the string "123" matched a "number"
    > the string "abc123" matched a "word"
    >
    > --
    > André Majorel <URL:http://www.teaser.fr/~amajorel/>
    > (Counterfeit: )
    > Religion: a magic device for turning unanswerable questions into
    > unquestionable answers. -- Art Gecko
     
    Xicheng Jia, Dec 17, 2006
    #5
  6. Andre Majorel wrote:
    > On 2006-12-17, Paul Lalli <> wrote:
    >>Andre Majorel wrote:
    >>>
    >>> Is there a way to override the value returned by a capture so
    >>> that $1 is set not to the characters matched by the parentheses
    >>> but some arbitrary string or number ?

    >>
    >> What is it you're actually *trying* to do, that you've decided this is
    >> the correct solution to?

    >
    > The goal is to know whether a relatively long string matches one
    > of several regexps (that's the easy part) and, if possible,
    > which regexp it matched.


    If you have a single long string with many matches then you may want to study
    the string first:

    perldoc -f study


    >> perl -le'
    >> my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
    >> for my $s (qw/123 abc123/) {
    >> for my $t (@type_of) {
    >> if ($s =~ /^$t->[0]$/) {
    >> print "$s matches a $t->[1]";
    >> last;
    >> }
    >> }
    >> }

    >
    > But then, you scan the data once for each regexp. Wouldn't the
    > execution time of /r0/ || /r1/ || ... || /rN/ approach N times
    > the execution time of /r0|r1|...|rN/ as N gets bigger ?


    It may seem counter-intuitive but using separate matches is usually faster, in
    fact this has been frequently asked:

    perldoc -q "How do I efficiently match many regular expressions at once"



    John
    --
    Perl isn't a toolbox, but a small machine shop where you can special-order
    certain sorts of tools at low cost and in short order. -- Larry Wall
     
    John W. Krahn, Dec 17, 2006
    #6
  7. On 12/17/2006 05:47 AM, Andre Majorel wrote:
    > Is there a way to override the value returned by a capture so
    > that $1 is set not to the characters matched by the parentheses
    > but some arbitrary string or number ? I'm thinking of something
    > like this :
    >
    > $ perl -e '
    > sub what ($)
    > {
    > if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
    > {
    > print "the string \"$_[0]\" matched a \"$1\"\n";
    > }
    > }
    >
    > what ("123");
    > what ("abc123");
    > '
    > the string "123" matched a "number"
    > the string "abc123" matched a "word"
    >


    Take a look at this:

    #!/usr/local/bin/perl5.9.4

    use strict;
    use warnings;
    local $\ = "\n";
    printf "Version: %vd\n", $^V;
    print whatsis2('3234');
    print whatsis2('3234');
    print whatsis2('3234');
    print whatsis2('3234');
    print whatsis2('3234');

    my $what;

    sub whatsis2 {
    local $_ = $_[0];
    $what = 'unknown';
    m/(\d+(?{$what = 'integer'}))|([[:alpha:]]+(?{$what = 'word'}))/;
    $what;
    }

    sub whatsis {
    local $_ = $_[0];
    $what = 'unknown';
    $what = 'integer' if /^\d+$/;
    $what = 'word' if /^[[:alpha:]]+$/;
    $what;
    }

    __END__

    "Whatsis2" attempts to get close to what you are looking for, but I like
    the much less obfuscated "whatsis."

    I decided to print the same thing five times because "whatsis2" had an
    interesting bug when $what was defined within "whatsis2."


    --

    http://home.earthlink.net/~mumia.w.18.spam/
     
    Mumia W. (on aioe), Dec 17, 2006
    #7
  8. On 2006-12-17, John W. Krahn <> wrote:
    > Andre Majorel wrote:
    >
    >> But then, you scan the data once for each regexp. Wouldn't the
    >> execution time of /r0/ || /r1/ || ... || /rN/ approach N times
    >> the execution time of /r0|r1|...|rN/ as N gets bigger ?

    >
    > It may seem counter-intuitive but using separate matches is usually
    > faster, in fact this has been frequently asked:
    >
    > perldoc -q "How do I efficiently match many regular expressions at once"


    Just checked and you're right ! On my test case, /a|b|c|d/ is
    about twenty times slower than (/a/ || /b/ || /c/ || /d/). Gack !

    A similar program in C gives the expected results, namely
    /a\|b\|c\|d/ being about four times faster than (/a/ || /b/ ||
    /c/ || /d/).

    From comparing the execution times of the Perl and C
    implementations, it would appear that

    - Perl is magically twice as fast as regexec() at evaluating
    (/a/ || /b/ || /c/ || /d/). I suspect Perl gathers information
    on the string while evaluating /a/ and uses that to save time
    on /b/ through /d/.

    - Perl handles the alternation operator ("|") in a remarkably
    inefficient way, about ten times slower than regexec() on my
    system.

    Thanks.

    --
    André Majorel <URL:http://www.teaser.fr/~amajorel/>
    (Counterfeit: )
    Religion: a magic device for turning unanswerable questions into
    unquestionable answers. -- Art Gecko
     
    Andre Majorel, Dec 18, 2006
    #8
  9. Andre Majorel <> wrote in
    news::

    > On 2006-12-17, John W. Krahn <> wrote:
    >> Andre Majorel wrote:
    >>
    >>> But then, you scan the data once for each regexp. Wouldn't the
    >>> execution time of /r0/ || /r1/ || ... || /rN/ approach N times
    >>> the execution time of /r0|r1|...|rN/ as N gets bigger ?

    >>
    >> It may seem counter-intuitive but using separate matches is usually
    >> faster, in fact this has been frequently asked:
    >>
    >> perldoc -q "How do I efficiently match many regular expressions at
    >> once"

    >
    > Just checked and you're right ! On my test case, /a|b|c|d/ is
    > about twenty times slower than (/a/ || /b/ || /c/ || /d/). Gack !


    You know that logical operators short-circuit, right?

    > A similar program in C gives the expected results, namely
    > /a\|b\|c\|d/ being about four times faster than (/a/ || /b/ ||
    > /c/ || /d/).


    I would be very interested in seeing this 'similar' C program given that
    regular expressions are not part of C.

    > - Perl is magically twice as fast as regexec() at evaluating
    > (/a/ || /b/ || /c/ || /d/). I suspect Perl gathers information
    > on the string while evaluating /a/ and uses that to save time
    > on /b/ through /d/.


    It is not magic. If /a/ matches, then the rest of the matches don't have to
    be tried.

    > - Perl handles the alternation operator ("|") in a remarkably
    > inefficient way, about ten times slower than regexec() on my
    > system.


    What is regexec?

    Sinan
     
    A. Sinan Unur, Dec 18, 2006
    #9
  10. On 2006-12-17, Xicheng Jia <> wrote:
    > Andre Majorel wrote:
    >> Is there a way to override the value returned by a capture so
    >> that $1 is set not to the characters matched by the parentheses
    >> but some arbitrary string or number ?

    >
    > How about this:
    >
    > perl -e '
    > sub what ($)
    > {
    > if ($_[0] =~ /^(?:\d+(?{"number"})|\w+(?{"word"}))$/)
    > {
    > print qq(the string "$_[0]" matched a "$^R" \n);
    > }
    > }
    > what ("123");
    > what ("abc123");
    > '


    Thanks, this looks interesting. Do you have any idea how likely
    it is to be "changed or deleted without notice", as perlre(1)
    puts it ?

    > (Be sure to use two anchors '^', and '$' in your pattern)


    Even if my regexp is not anchored ?

    --
    André Majorel <URL:http://www.teaser.fr/~amajorel/>
    (Counterfeit: )
    Religion: a magic device for turning unanswerable questions into
    unquestionable answers. -- Art Gecko
     
    Andre Majorel, Dec 18, 2006
    #10
  11. Andre Majorel

    Ben Morrow Guest

    Quoth "A. Sinan Unur" <>:
    > > A similar program in C gives the expected results, namely
    > > /a\|b\|c\|d/ being about four times faster than (/a/ || /b/ ||
    > > /c/ || /d/).

    >
    > I would be very interested in seeing this 'similar' C program given that
    > regular expressions are not part of C.


    They are part of POSIX, however.

    > > - Perl is magically twice as fast as regexec() at evaluating
    > > (/a/ || /b/ || /c/ || /d/). I suspect Perl gathers information
    > > on the string while evaluating /a/ and uses that to save time
    > > on /b/ through /d/.

    >
    > It is not magic. If /a/ matches, then the rest of the matches don't have to
    > be tried.
    >
    > > - Perl handles the alternation operator ("|") in a remarkably
    > > inefficient way, about ten times slower than regexec() on my
    > > system.


    I think it's highly unlikely that Perl's regular expression engine is
    slower than POSIX' in a case like this. I strongly suspect that what
    your benchmark is showing is that C is faster than Perl: that is, you're
    not actually comparing the speeds of the matching, as the rest of Perl
    is swamping the time taken to perform the match.

    FWIW, a *lot* of work has gone into alternations in the development
    version of Perl, so when 5.10 is released I wouldn't be surprised if
    /a|b|c|d/ is faster that /a/||/b/||/c/||/d/. Note that when benchmarking
    you need to test cases where none of the patterns match: this is where
    the new code is most likely to win, as Perl may be able to determine
    straight off that none of the alternations could possibly match.

    > What is regexec?


    regexec(3), in <regex.h>, POSIX regular expressions. What Perl's regexen
    are (very distantly) based on.

    Ben

    --
    'Deserve [death]? I daresay he did. Many live that deserve death. And some die
    that deserve life. Can you give it to them? Then do not be too eager to deal
    out death in judgement. For even the very wise cannot see all ends.'
     
    Ben Morrow, Dec 18, 2006
    #11
  12. Andre Majorel

    Ben Morrow Guest

    Quoth Andre Majorel <>:
    > On 2006-12-17, Xicheng Jia <> wrote:
    >

    < a pattern involving (?{}) >
    >
    > Thanks, this looks interesting. Do you have any idea how likely
    > it is to be "changed or deleted without notice", as perlre(1)
    > puts it ?


    Not at all. That notice has been there since the code assertions were
    invented, but enough people use them now that p5p are not going to
    be able to remove them.

    Ben

    --
    The Earth is degenerating these days. Bribery and corruption abound.
    Children no longer mind their parents, every man wants to write a book,
    and it is evident that the end of the world is fast approaching.
    Assyrian stone tablet, c.2800 BC
     
    Ben Morrow, Dec 18, 2006
    #12
  13. Andre Majorel

    Mirco Wahab Guest

    Mirco Wahab wrote:

    Oops...

    > $> perl -e '
    > sub what ($) {
    > my %u;
    > $u{$-[0]} = "number" while $_[0] =~ /([0-9]+)/g;
    > $u{$-[0]} = "word" while $_[0] =~ /([A-z]+)/g;
    > map $u{$_}, sort keys %u;


    must be:
    ... sort {$a<=>$b} keys %u;

    (beware of more than 10 occurrences ;-)


    Regards

    Mirco
     
    Mirco Wahab, Dec 18, 2006
    #13
  14. On 2006-12-18, Ben Morrow <> wrote:
    > Quoth "A. Sinan Unur" <>:
    >
    >> > A similar program in C gives the expected results, namely
    >> > /a\|b\|c\|d/ being about four times faster than (/a/ || /b/ ||
    >> > /c/ || /d/).

    >>
    >> I would be very interested in seeing this 'similar' C program given
    >> that regular expressions are not part of C.


    http://www.teaser.fr/~amajorel/regexp-alt/

    On my system, the execution times in seconds for 10,000 records are :

    C Perl Perl/C What
    1.21 53.05 43.8 "alt" (/a\|b\|c\|d/ for C, /a|b|c|d/ for Perl)
    1.21 2.45 2.02 "class" (/[abcd]/)
    4.81 2.39 0.497 "mult" (/a/ || /b/ || /c/ || /d/)

    This is Glibc 2.3.6 and Perl 5.8.8 on Linux i386.

    >> > - Perl is magically twice as fast as regexec() at evaluating
    >> > (/a/ || /b/ || /c/ || /d/). I suspect Perl gathers information
    >> > on the string while evaluating /a/ and uses that to save time
    >> > on /b/ through /d/.

    >>
    >> It is not magic. If /a/ matches, then the rest of the matches
    >> don't have to be tried.


    But it doesn't, of course.

    >> > - Perl handles the alternation operator ("|") in a remarkably
    >> > inefficient way, about ten times slower than regexec() on my
    >> > system.

    >
    > I think it's highly unlikely that Perl's regular expression engine is
    > slower than POSIX' in a case like this. I strongly suspect that what
    > your benchmark is showing is that C is faster than Perl: that is, you're
    > not actually comparing the speeds of the matching, as the rest of Perl
    > is swamping the time taken to perform the match.


    Perl is 25 times slower on /a|b|c|d/ than on /[abcd]/ and the
    surrounding code is identical. The Perl regexp engine clearly
    has a problem with "|".

    > FWIW, a *lot* of work has gone into alternations in the development
    > version of Perl, so when 5.10 is released I wouldn't be surprised if
    > /a|b|c|d/ is faster that /a/||/b/||/c/||/d/.


    Thanks, that's good to know. Are there binary snapshots for
    Linux i386 ?

    > Note that when benchmarking you need to test cases where none
    > of the patterns match: this is where the new code is most
    > likely to win, as Perl may be able to determine straight off
    > that none of the alternations could possibly match.


    That is the case.

    --
    André Majorel <URL:http://www.teaser.fr/~amajorel/>
    (Counterfeit: )
    Religion: a magic device for turning unanswerable questions into
    unquestionable answers. -- Art Gecko
     
    Andre Majorel, Dec 18, 2006
    #14
  15. Andre Majorel wrote:
    >
    > Perl is 25 times slower on /a|b|c|d/ than on /[abcd]/ and the
    > surrounding code is identical. The Perl regexp engine clearly
    > has a problem with "|".


    That is a known problem on current versions of Perl. It will be fixed in the
    next release (5.10).


    John
    --
    Perl isn't a toolbox, but a small machine shop where you can special-order
    certain sorts of tools at low cost and in short order. -- Larry Wall
     
    John W. Krahn, Dec 18, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jason Shohet
    Replies:
    16
    Views:
    829
  2. Amadelle
    Replies:
    1
    Views:
    2,677
    Amadelle
    Jun 5, 2004
  3. Jiggaz
    Replies:
    2
    Views:
    2,248
    Todd Casey
    Jul 8, 2004
  4. Rod
    Replies:
    6
    Views:
    8,779
  5. Ferenc Engard
    Replies:
    2
    Views:
    161
Loading...

Share This Page