Controlling the value returned in $1

A

Andre Majorel

Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ? I'm thinking of something
like this :

$ perl -e '
sub what ($)
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";
}
}

what ("123");
what ("abc123");
'
the string "123" matched a "number"
the string "abc123" matched a "word"
 
M

Michele Dondi

Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ? I'm thinking of something
like this :

Smell of XY problem here, why do you want to do so?
$ perl -e '
sub what ($)

Hardly any need for prototypes. So much more in a minimal example. No
need for a sub altogether, to say the truth...
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";

Incidentally, you can use alternate delimiters:

print qq|the string "$_[0]" matched a "$1"\n|;
the string "123" matched a "number"
the string "abc123" matched a "word"

Usual answer: don't use one regex where two (or more) would better fit
the bill.


Michele
 
P

Paul Lalli

Andre said:
Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ?

What is it you're actually *trying* to do, that you've decided this is
the correct solution to?
I'm thinking of something like this :

$ perl -e '
sub what ($)
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";
}
}

what ("123");
what ("abc123");
'
the string "123" matched a "number"
the string "abc123" matched a "word"

perl -le'
my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
for my $s (qw/123 abc123/) {
for my $t (@type_of) {
if ($s =~ /^$t->[0]$/) {
print "$s matches a $t->[1]";
last;
}
}
}

'
 
A

Andre Majorel

What is it you're actually *trying* to do, that you've decided this is
the correct solution to?

The goal is to know whether a relatively long string matches one
of several regexps (that's the easy part) and, if possible,
which regexp it matched.
perl -le'
my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
for my $s (qw/123 abc123/) {
for my $t (@type_of) {
if ($s =~ /^$t->[0]$/) {
print "$s matches a $t->[1]";
last;
}
}
}

But then, you scan the data once for each regexp. Wouldn't the
execution time of /r0/ || /r1/ || ... || /rN/ approach N times
the execution time of /r0|r1|...|rN/ as N gets bigger ?
 
X

Xicheng Jia

Andre said:
Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ? I'm thinking of something
like this :

$ perl -e '
sub what ($)
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";
}
}

what ("123");
what ("abc123");
'

How about this:

perl -e '
sub what ($)
{
if ($_[0] =~ /^(?:\d+(?{"number"})|\w+(?{"word"}))$/) # Fictitious
syntax
{
print qq(the string "$_[0]" matched a "$^R" \n);
}
}
what ("123");
what ("abc123");
'
(Be sure to use two anchors '^', and '$' in your pattern)

Regards,
Xicheng
 
J

John W. Krahn

Andre said:
The goal is to know whether a relatively long string matches one
of several regexps (that's the easy part) and, if possible,
which regexp it matched.

If you have a single long string with many matches then you may want to study
the string first:

perldoc -f study

perl -le'
my @type_of = ( [qr/\d+/ => q{integer}], [qr/\w+/ => q{word}] );
for my $s (qw/123 abc123/) {
for my $t (@type_of) {
if ($s =~ /^$t->[0]$/) {
print "$s matches a $t->[1]";
last;
}
}
}

But then, you scan the data once for each regexp. Wouldn't the
execution time of /r0/ || /r1/ || ... || /rN/ approach N times
the execution time of /r0|r1|...|rN/ as N gets bigger ?

It may seem counter-intuitive but using separate matches is usually faster, in
fact this has been frequently asked:

perldoc -q "How do I efficiently match many regular expressions at once"



John
 
M

Mumia W. (on aioe)

Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ? I'm thinking of something
like this :

$ perl -e '
sub what ($)
{
if ($_[0] =~ /((?"integer"\d+)|(?"word"\w+))/) # Fictitious syntax
{
print "the string \"$_[0]\" matched a \"$1\"\n";
}
}

what ("123");
what ("abc123");
'
the string "123" matched a "number"
the string "abc123" matched a "word"

Take a look at this:

#!/usr/local/bin/perl5.9.4

use strict;
use warnings;
local $\ = "\n";
printf "Version: %vd\n", $^V;
print whatsis2('3234');
print whatsis2('3234');
print whatsis2('3234');
print whatsis2('3234');
print whatsis2('3234');

my $what;

sub whatsis2 {
local $_ = $_[0];
$what = 'unknown';
m/(\d+(?{$what = 'integer'}))|([[:alpha:]]+(?{$what = 'word'}))/;
$what;
}

sub whatsis {
local $_ = $_[0];
$what = 'unknown';
$what = 'integer' if /^\d+$/;
$what = 'word' if /^[[:alpha:]]+$/;
$what;
}

__END__

"Whatsis2" attempts to get close to what you are looking for, but I like
the much less obfuscated "whatsis."

I decided to print the same thing five times because "whatsis2" had an
interesting bug when $what was defined within "whatsis2."
 
A

Andre Majorel

It may seem counter-intuitive but using separate matches is usually
faster, in fact this has been frequently asked:

perldoc -q "How do I efficiently match many regular expressions at once"

Just checked and you're right ! On my test case, /a|b|c|d/ is
about twenty times slower than (/a/ || /b/ || /c/ || /d/). Gack !

A similar program in C gives the expected results, namely
/a\|b\|c\|d/ being about four times faster than (/a/ || /b/ ||
/c/ || /d/).

From comparing the execution times of the Perl and C
implementations, it would appear that

- Perl is magically twice as fast as regexec() at evaluating
(/a/ || /b/ || /c/ || /d/). I suspect Perl gathers information
on the string while evaluating /a/ and uses that to save time
on /b/ through /d/.

- Perl handles the alternation operator ("|") in a remarkably
inefficient way, about ten times slower than regexec() on my
system.

Thanks.
 
A

A. Sinan Unur

Just checked and you're right ! On my test case, /a|b|c|d/ is
about twenty times slower than (/a/ || /b/ || /c/ || /d/). Gack !

You know that logical operators short-circuit, right?
A similar program in C gives the expected results, namely
/a\|b\|c\|d/ being about four times faster than (/a/ || /b/ ||
/c/ || /d/).

I would be very interested in seeing this 'similar' C program given that
regular expressions are not part of C.
- Perl is magically twice as fast as regexec() at evaluating
(/a/ || /b/ || /c/ || /d/). I suspect Perl gathers information
on the string while evaluating /a/ and uses that to save time
on /b/ through /d/.

It is not magic. If /a/ matches, then the rest of the matches don't have to
be tried.
- Perl handles the alternation operator ("|") in a remarkably
inefficient way, about ten times slower than regexec() on my
system.

What is regexec?

Sinan
 
A

Andre Majorel

Andre said:
Is there a way to override the value returned by a capture so
that $1 is set not to the characters matched by the parentheses
but some arbitrary string or number ?

How about this:

perl -e '
sub what ($)
{
if ($_[0] =~ /^(?:\d+(?{"number"})|\w+(?{"word"}))$/)
{
print qq(the string "$_[0]" matched a "$^R" \n);
}
}
what ("123");
what ("abc123");
'

Thanks, this looks interesting. Do you have any idea how likely
it is to be "changed or deleted without notice", as perlre(1)
puts it ?
(Be sure to use two anchors '^', and '$' in your pattern)

Even if my regexp is not anchored ?
 
B

Ben Morrow

Quoth "A. Sinan Unur said:
I would be very interested in seeing this 'similar' C program given that
regular expressions are not part of C.

They are part of POSIX, however.
It is not magic. If /a/ matches, then the rest of the matches don't have to
be tried.

I think it's highly unlikely that Perl's regular expression engine is
slower than POSIX' in a case like this. I strongly suspect that what
your benchmark is showing is that C is faster than Perl: that is, you're
not actually comparing the speeds of the matching, as the rest of Perl
is swamping the time taken to perform the match.

FWIW, a *lot* of work has gone into alternations in the development
version of Perl, so when 5.10 is released I wouldn't be surprised if
/a|b|c|d/ is faster that /a/||/b/||/c/||/d/. Note that when benchmarking
you need to test cases where none of the patterns match: this is where
the new code is most likely to win, as Perl may be able to determine
straight off that none of the alternations could possibly match.
What is regexec?

regexec(3), in <regex.h>, POSIX regular expressions. What Perl's regexen
are (very distantly) based on.

Ben
 
B

Ben Morrow

Thanks, this looks interesting. Do you have any idea how likely
it is to be "changed or deleted without notice", as perlre(1)
puts it ?

Not at all. That notice has been there since the code assertions were
invented, but enough people use them now that p5p are not going to
be able to remove them.

Ben
 
M

Mirco Wahab

Mirco Wahab wrote:

Oops...
$> perl -e '
sub what ($) {
my %u;
$u{$-[0]} = "number" while $_[0] =~ /([0-9]+)/g;
$u{$-[0]} = "word" while $_[0] =~ /([A-z]+)/g;
map $u{$_}, sort keys %u;

must be:
... sort {$a<=>$b} keys %u;

(beware of more than 10 occurrences ;-)


Regards

Mirco
 
A

Andre Majorel

http://www.teaser.fr/~amajorel/regexp-alt/

On my system, the execution times in seconds for 10,000 records are :

C Perl Perl/C What
1.21 53.05 43.8 "alt" (/a\|b\|c\|d/ for C, /a|b|c|d/ for Perl)
1.21 2.45 2.02 "class" (/[abcd]/)
4.81 2.39 0.497 "mult" (/a/ || /b/ || /c/ || /d/)

This is Glibc 2.3.6 and Perl 5.8.8 on Linux i386.

But it doesn't, of course.
I think it's highly unlikely that Perl's regular expression engine is
slower than POSIX' in a case like this. I strongly suspect that what
your benchmark is showing is that C is faster than Perl: that is, you're
not actually comparing the speeds of the matching, as the rest of Perl
is swamping the time taken to perform the match.

Perl is 25 times slower on /a|b|c|d/ than on /[abcd]/ and the
surrounding code is identical. The Perl regexp engine clearly
has a problem with "|".
FWIW, a *lot* of work has gone into alternations in the development
version of Perl, so when 5.10 is released I wouldn't be surprised if
/a|b|c|d/ is faster that /a/||/b/||/c/||/d/.

Thanks, that's good to know. Are there binary snapshots for
Linux i386 ?
Note that when benchmarking you need to test cases where none
of the patterns match: this is where the new code is most
likely to win, as Perl may be able to determine straight off
that none of the alternations could possibly match.

That is the case.
 
J

John W. Krahn

Andre said:
Perl is 25 times slower on /a|b|c|d/ than on /[abcd]/ and the
surrounding code is identical. The Perl regexp engine clearly
has a problem with "|".

That is a known problem on current versions of Perl. It will be fixed in the
next release (5.10).


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top