Controlling the value returned in $1

Andre Majorel · Dec 17, 2006

Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ? I'm thinking of something
like this :

$ perl -e '
sub what ($)
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";
}
}

what ("123");
what ("abc123");
'
the string "123" matched a "number"
the string "abc123" matched a "word"

Michele Dondi · Dec 17, 2006

Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ? I'm thinking of something
like this :

Smell of XY problem here, why do you want to do so?

$ perl -e '
sub what ($)

Hardly any need for prototypes. So much more in a minimal example. No
need for a sub altogether, to say the truth...

{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";

Incidentally, you can use alternate delimiters:

print qq|the string "$_[0]" matched a "$1"\n|;

the string "123" matched a "number"
the string "abc123" matched a "word"

Usual answer: don't use one regex where two (or more) would better fit
the bill.

Michele

Paul Lalli · Dec 17, 2006

Andre said:
Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ?

What is it you're actually *trying* to do, that you've decided this is
the correct solution to?

I'm thinking of something like this :

$ perl -e '
sub what ($)
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";
}
}

what ("123");
what ("abc123");
'
the string "123" matched a "number"
the string "abc123" matched a "word"

perl -le'
my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
for my $s (qw/123 abc123/) {
for my $t (@type_of) {
if ($s =~ /^$t->[0]$/) {
print "$s matches a $t->[1]";
last;
}
}
}

'

Andre Majorel · Dec 17, 2006

What is it you're actually *trying* to do, that you've decided this is
the correct solution to?

The goal is to know whether a relatively long string matches one
of several regexps (that's the easy part) and, if possible,
which regexp it matched.

perl -le'
my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
for my $s (qw/123 abc123/) {
for my $t (@type_of) {
if ($s =~ /^$t->[0]$/) {
print "$s matches a $t->[1]";
last;
}
}
}

But then, you scan the data once for each regexp. Wouldn't the
execution time of /r0/ || /r1/ || ... || /rN/ approach N times
the execution time of /r0|r1|...|rN/ as N gets bigger ?

Xicheng Jia · Dec 17, 2006

Andre said:
Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ? I'm thinking of something
like this :

$ perl -e '
sub what ($)
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";
}
}

what ("123");
what ("abc123");
'

How about this:

perl -e '
sub what ($)
{
if ($_[0] =~ /^(?:\d+(?{"number"})|\w+(?{"word"}))$/) # Fictitious
syntax
{
print qq(the string "$_[0]" matched a "$^R" \n);
}
}
what ("123");
what ("abc123");
'
(Be sure to use two anchors '^', and '$' in your pattern)

Regards,
Xicheng

John W. Krahn · Dec 17, 2006

Andre said:
The goal is to know whether a relatively long string matches one
of several regexps (that's the easy part) and, if possible,
which regexp it matched.

If you have a single long string with many matches then you may want to study
the string first:

perldoc -f study

perl -le'
my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
for my $s (qw/123 abc123/) {
for my $t (@type_of) {
if ($s =~ /^$t->[0]$/) {
print "$s matches a $t->[1]";
last;
}
}
}

Click to expand...

But then, you scan the data once for each regexp. Wouldn't the
execution time of /r0/ || /r1/ || ... || /rN/ approach N times
the execution time of /r0|r1|...|rN/ as N gets bigger ?

It may seem counter-intuitive but using separate matches is usually faster, in
fact this has been frequently asked:

perldoc -q "How do I efficiently match many regular expressions at once"

John

Mumia W. (on aioe) · Dec 17, 2006

Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ? I'm thinking of something
like this :

$ perl -e '
sub what ($)
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";
}
}

what ("123");
what ("abc123");
'
the string "123" matched a "number"
the string "abc123" matched a "word"

Take a look at this:

#!/usr/local/bin/perl5.9.4

use strict;
use warnings;
local $\ = "\n";
printf "Version: %vd\n", $^V;
print whatsis2('3234');
print whatsis2('3234');
print whatsis2('3234');
print whatsis2('3234');
print whatsis2('3234');

my $what;

sub whatsis2 {
local $_ = $_[0];
$what = 'unknown';
m/(\d+(?{$what = 'integer'}))|([[:alpha:]]+(?{$what = 'word'}))/;
$what;
}

sub whatsis {
local $_ = $_[0];
$what = 'unknown';
$what = 'integer' if /^\d+$/;
$what = 'word' if /^[[:alpha:]]+$/;
$what;
}

__END__

"Whatsis2" attempts to get close to what you are looking for, but I like
the much less obfuscated "whatsis."

I decided to print the same thing five times because "whatsis2" had an
interesting bug when $what was defined within "whatsis2."

Andre Majorel · Dec 18, 2006

It may seem counter-intuitive but using separate matches is usually
faster, in fact this has been frequently asked:

perldoc -q "How do I efficiently match many regular expressions at once"

Just checked and you're right ! On my test case, /a|b|c|d/ is
about twenty times slower than (/a/ || /b/ || /c/ || /d/). Gack !

A similar program in C gives the expected results, namely
/a\|b\|c\|d/ being about four times faster than (/a/ || /b/ ||
/c/ || /d/).

From comparing the execution times of the Perl and C
implementations, it would appear that

- Perl is magically twice as fast as regexec() at evaluating
(/a/ || /b/ || /c/ || /d/). I suspect Perl gathers information
on the string while evaluating /a/ and uses that to save time
on /b/ through /d/.

- Perl handles the alternation operator ("|") in a remarkably
inefficient way, about ten times slower than regexec() on my
system.

Thanks.

A. Sinan Unur · Dec 18, 2006

Just checked and you're right ! On my test case, /a|b|c|d/ is
about twenty times slower than (/a/ || /b/ || /c/ || /d/). Gack !

You know that logical operators short-circuit, right?

A similar program in C gives the expected results, namely
/a\|b\|c\|d/ being about four times faster than (/a/ || /b/ ||
/c/ || /d/).

I would be very interested in seeing this 'similar' C program given that
regular expressions are not part of C.

- Perl is magically twice as fast as regexec() at evaluating
(/a/ || /b/ || /c/ || /d/). I suspect Perl gathers information
on the string while evaluating /a/ and uses that to save time
on /b/ through /d/.

It is not magic. If /a/ matches, then the rest of the matches don't have to
be tried.

- Perl handles the alternation operator ("|") in a remarkably
inefficient way, about ten times slower than regexec() on my
system.

What is regexec?

Sinan

Andre Majorel · Dec 18, 2006

Andre said:
Andre said:

Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ?

Click to expand...

How about this:

perl -e '
sub what ($)
{
if ($_[0] =~ /^(?:\d+(?{"number"})|\w+(?{"word"}))$/)
{
print qq(the string "$_[0]" matched a "$^R" \n);
}
}
what ("123");
what ("abc123");
'

Thanks, this looks interesting. Do you have any idea how likely
it is to be "changed or deleted without notice", as perlre(1)
puts it ?

(Be sure to use two anchors '^', and '$' in your pattern)

Even if my regexp is not anchored ?

Ben Morrow · Dec 18, 2006

Quoth "A. Sinan Unur said:
I would be very interested in seeing this 'similar' C program given that
regular expressions are not part of C.

They are part of POSIX, however.

It is not magic. If /a/ matches, then the rest of the matches don't have to
be tried.

I think it's highly unlikely that Perl's regular expression engine is
slower than POSIX' in a case like this. I strongly suspect that what
your benchmark is showing is that C is faster than Perl: that is, you're
not actually comparing the speeds of the matching, as the rest of Perl
is swamping the time taken to perform the match.

FWIW, a *lot* of work has gone into alternations in the development
version of Perl, so when 5.10 is released I wouldn't be surprised if
/a|b|c|d/ is faster that /a/||/b/||/c/||/d/. Note that when benchmarking
you need to test cases where none of the patterns match: this is where
the new code is most likely to win, as Perl may be able to determine
straight off that none of the alternations could possibly match.

What is regexec?

regexec(3), in <regex.h>, POSIX regular expressions. What Perl's regexen
are (very distantly) based on.

Ben

Ben Morrow · Dec 18, 2006

Thanks, this looks interesting. Do you have any idea how likely
it is to be "changed or deleted without notice", as perlre(1)
puts it ?

Not at all. That notice has been there since the code assertions were
invented, but enough people use them now that p5p are not going to
be able to remove them.

Ben

Mirco Wahab · Dec 18, 2006

Mirco Wahab wrote:

Oops...

$> perl -e '
sub what ($) {
my %u;
$u{$-[0]} = "number" while $_[0] =~ /([0-9]+)/g;
$u{$-[0]} = "word" while $_[0] =~ /([A-z]+)/g;
map $u{$_}, sort keys %u;

must be:
... sort {$a<=>$b} keys %u;

(beware of more than 10 occurrences ;-)

Regards

Mirco

Andre Majorel · Dec 18, 2006

http://www.teaser.fr/~amajorel/regexp-alt/

On my system, the execution times in seconds for 10,000 records are :

C Perl Perl/C What
1.21 53.05 43.8 "alt" (/a\|b\|c\|d/ for C, /a|b|c|d/ for Perl)
1.21 2.45 2.02 "class" (/[abcd]/)
4.81 2.39 0.497 "mult" (/a/ || /b/ || /c/ || /d/)

This is Glibc 2.3.6 and Perl 5.8.8 on Linux i386.

But it doesn't, of course.

I think it's highly unlikely that Perl's regular expression engine is
slower than POSIX' in a case like this. I strongly suspect that what
your benchmark is showing is that C is faster than Perl: that is, you're
not actually comparing the speeds of the matching, as the rest of Perl
is swamping the time taken to perform the match.

Perl is 25 times slower on /a|b|c|d/ than on /[abcd]/ and the
surrounding code is identical. The Perl regexp engine clearly
has a problem with "|".

FWIW, a *lot* of work has gone into alternations in the development
version of Perl, so when 5.10 is released I wouldn't be surprised if
/a|b|c|d/ is faster that /a/||/b/||/c/||/d/.

Thanks, that's good to know. Are there binary snapshots for
Linux i386 ?

Note that when benchmarking you need to test cases where none
of the patterns match: this is where the new code is most
likely to win, as Perl may be able to determine straight off
that none of the alternations could possibly match.

That is the case.

John W. Krahn · Dec 18, 2006

Andre said:
Perl is 25 times slower on /a|b|c|d/ than on /[abcd]/ and the
surrounding code is identical. The Perl regexp engine clearly
has a problem with "|".

That is a known problem on current versions of Perl. It will be fixed in the
next release (5.10).

John

Char array as a function returned value	2	Nov 23, 2019
How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023
Trying to get the average value of the elements, please help ! JavaScript	3	Dec 13, 2022
Write a program in Java that multiplies the column vector A [n] bysome number. The matrix and number are initiated by a random value generator.	2	May 28, 2022
Update element value ViewWeb2 .net Control	1	Oct 7, 2022
who stole my returned list?	6	Jun 23, 2010
constant string as controlling expression in _Generic gives error	8	Dec 8, 2013
Only one table shows up with the information	2	Mar 29, 2023

Controlling the value returned in $1

Andre Majorel

Michele Dondi

Paul Lalli

Andre Majorel

Xicheng Jia

John W. Krahn

Mumia W. (on aioe)

Andre Majorel

A. Sinan Unur

Andre Majorel

Ben Morrow

Ben Morrow

Mirco Wahab

Andre Majorel

John W. Krahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads