Regex: Backreferences do not work inside quantifiers?

  • Thread starter Wolfgang Thomas
  • Start date
W

Wolfgang Thomas

I have a line of the following format:
string length followed by colon followed by the actual
string.
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;
print "$1\n";
print "$2\n";


However this does not match. Neither $1 nor $2 become
defined. If I replace \1 with 3 it works as expected,
I get 3 in $1 and "abc" in $2.

I have studied the "Perl Programming" book and
the active perl regex documentation, but could not
find a restriction that backreferences must not be
used inside quantifiers.

What am I doing wrong?
 
I

it_says_BALLS_on_your forehead

Wolfgang said:
I have a line of the following format:
string length followed by colon followed by the actual
string.
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;
print "$1\n";
print "$2\n";


However this does not match. Neither $1 nor $2 become
defined. If I replace \1 with 3 it works as expected,
I get 3 in $1 and "abc" in $2.

I have studied the "Perl Programming" book and
the active perl regex documentation, but could not
find a restriction that backreferences must not be
used inside quantifiers.


i haven't studied this yet, but are you sure regexes are the best tool
for what you're doing?
 
W

Wolfgang Thomas

i haven't studied this yet, but are you sure regexes are the best tool
for what you're doing?

Maybe not, but still I wonder why it does not work.
 
A

A. Sinan Unur

I have a line of the following format:
string length followed by colon followed by the actual
string.
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;

Where did you get the notion that backreferences could be used in this
way?

....
What am I doing wrong?

You are using regular expressions to solve a problem to which they are
ill-suited.

Important question: What do you want to do if the string to the right of
the colon is shorter than the length specified?

Your attempted use of .{\1} means you want the match to fail in that
case. I don't know if this matters.

#!/usr/bin/perl

use strict;
use warnings;

while ( <DATA> ) {
chomp;
next unless length;
my $length = 0 + substr $_, 0, index($_, ':');
my $string = substr $_, 1 + index($_, ':'), $length;
print "Length = $length\nString = $string\n";
}


__DATA__
3:abcd
10:012345689
3:abc
5:aaa
 
M

Matt Garrish

Wolfgang Thomas said:
I have a line of the following format:
string length followed by colon followed by the actual
string.

So why aren't you using split and substr?
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;

\d is shorthand for a character class; why are you then putting it in one?
print "$1\n";
print "$2\n";


However this does not match. Neither $1 nor $2 become
defined. If I replace \1 with 3 it works as expected,
I get 3 in $1 and "abc" in $2.

That's because you can't dynamically assign the value. To perl it's just
braces and a comma to match. For example:

my $s = "3:a{,}bcd";
$s =~ /(\d+):(.{\1,})/;
print "$1\n";
print "$2\n";

There might be some way to do this using the extended regexes, but off the
top of my head I couldn't say, and would recommend the two functions named
above... : )

Matt
 
M

Matt Garrish

Matt Garrish said:
Wolfgang Thomas said:
I have a line of the following format:
string length followed by colon followed by the actual
string.

So why aren't you using split and substr?
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;

\d is shorthand for a character class; why are you then putting it in one?
print "$1\n";
print "$2\n";


However this does not match. Neither $1 nor $2 become
defined. If I replace \1 with 3 it works as expected,
I get 3 in $1 and "abc" in $2.

That's because you can't dynamically assign the value. To perl it's just
braces and a comma to match. For example:

my $s = "3:a{,}bcd";

my $s = "3:a{3,}bcd";

Matt
 
T

Tad McClellan

Wolfgang Thomas said:
I have a line of the following format:
string length followed by colon followed by the actual
string.
my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;


The square brackets serve no purpose there.

You would need the s///s modifier to handle "3:1\n34567".

print "$1\n";
print "$2\n";


You should *never* use the dollar-digit variables unless you
have first ensured that the pattern match *succeeded*:

if ( $s =~ /(\d+):(.{\1})/s ) {
print "$1\n";
...

I have studied the "Perl Programming" book and
the active perl regex documentation,


What is the "active perl regex documentation"?

Is that different from the standard documentation for Perl?

but could not
find a restriction that backreferences must not be
used inside quantifiers.


Me either.

What am I doing wrong?


Nothing, other than attempting to use a backreference inside
of a quantifier. :)

Do it a different way, perhaps:


---------------------
#!/usr/bin/perl
use warnings;
use strict;

my($length, $string) = decompose( '3:abcd' );
print "string '$string' of length '$length'\n";

sub decompose {
my($s) = @_;
return() unless $s =~ s/^(\d+)://; # data does not match
my $len = $1;
my $str = substr $s, 0, $len;
return($len, $str);
}
 
W

Wolfgang Thomas

All,

thank you for your replies. You showed me how to better solve the problem.

Nevertheless I think that this restriction (or is it a bug?) should be
documented.
 
A

A. Sinan Unur

thank you for your replies. You showed me how to better solve the
problem.

What way to solve what problem? Please quote some context when you reply.
Nevertheless I think that this restriction (or is it a bug?) should be
documented.

Feel free to document it.

Sinan
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Wolfgang Thomas
$s =~ /([\d]+):(.{\1})/;

This should match, e.g.,

123:a{123}

"{" is special in REx only in very few of contexts. When working over
RExen, I tried to "f1x" this misfeature (inheritance of [IMO,
completely broken] HS implementation); however, there was not way to
even insert a warning without heavy backward-compatibility penalty.

The best one can hope for is what the latest CPerl is doing to
circumvent this misfortune: it highlights "{" differently in the
different meanings...

Hope this helps,
Ilya
 
W

Wolfgang Thomas

Ilya said:
$s =~ /([\d]+):(.{\1})/;

This should match, e.g.,

123:a{123}

"{" is special in REx only in very few of contexts. When working over
RExen, I tried to "f1x" this misfeature (inheritance of [IMO,
completely broken] HS implementation); however, there was not way to
even insert a warning without heavy backward-compatibility penalty.

The best one can hope for is what the latest CPerl is doing to
circumvent this misfortune: it highlights "{" differently in the
different meanings...

Hope this helps,

This was in fact very helpful. Thanks a lot.
 
T

Tad McClellan

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to
Wolfgang Thomas
$s =~ /([\d]+):(.{\1})/;

This should match, e.g.,

123:a{123}

"{" is special in REx only in very few of contexts.


Aha!

So it is only incompletely documented (from perlre.pod):

The following standard quantifiers are recognized:

* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times

(If a curly bracket occurs in any other context, it is treated
as a regular character.)

Looks like the OP's use of curly was in one of those "other" contexts...
 
J

John W. Krahn

Wolfgang said:
I have a line of the following format:
string length followed by colon followed by the actual
string.
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;
print "$1\n";
print "$2\n";


However this does not match. Neither $1 nor $2 become
defined. If I replace \1 with 3 it works as expected,
I get 3 in $1 and "abc" in $2.

If you didn't have that colon in the way you could use unpack():

$ perl -le'
my $s = "3:abcd";
print unpack "A/A*", $s;
'
:ab



John
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Tad McClellan
So it is only incompletely documented (from perlre.pod):

The following standard quantifiers are recognized:

* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times

(If a curly bracket occurs in any other context, it is treated
as a regular character.)

As usual, when documenting a historical misfeature, it is better to
insert an f-word (well, a c-word in this case ;-):

(CURRENTLY, If a curly bracket occurs in any other context, it is treated
as a regular character.)

Yours,
Ilya
 
C

Charles DeRykus

Wolfgang said:
I have a line of the following format:
string length followed by colon followed by the actual
string.
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;
print "$1\n";
print "$2\n";


However this does not match. Neither $1 nor $2 become
defined. If I replace \1 with 3 it works as expected,
I get 3 in $1 and "abc" in $2.

I have studied the "Perl Programming" book and
the active perl regex documentation, but could not
find a restriction that backreferences must not be
used inside quantifiers.

What am I doing wrong?

An extended regex possibility:

my $pos;
if ( $s =~ /(\d+):(?{ $pos=pos })/ ) {
print "count=$1 substring=",substr($s, $pos, $1);
}
 
X

Xicheng

John said:
Wolfgang said:
I have a line of the following format:
string length followed by colon followed by the actual
string.
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;
print "$1\n";
print "$2\n";


However this does not match. Neither $1 nor $2 become
defined. If I replace \1 with 3 it works as expected,
I get 3 in $1 and "abc" in $2.

If you didn't have that colon in the way you could use unpack():

$ perl -le'
my $s = "3:abcd";
print unpack "A/A*", $s;
'
:ab

this behavier of unpack() is really interesting:), but I think he can
skip that colon by adding a 'x', like:

$ perl -le'
my $s = "3:abcd";
print unpack "Ax/A*", $s;
'
===print====
abc
=========

Xicheng
 
X

Xicheng

Xicheng said:
John said:
Wolfgang said:
I have a line of the following format:
string length followed by colon followed by the actual
string.
To extract the string with the correct length I use the
following regular expression:

my $s = "3:abcd";
$s =~ /([\d]+):(.{\1})/;
print "$1\n";
print "$2\n";


However this does not match. Neither $1 nor $2 become
defined. If I replace \1 with 3 it works as expected,
I get 3 in $1 and "abc" in $2.

If you didn't have that colon in the way you could use unpack():

$ perl -le'
my $s = "3:abcd";
print unpack "A/A*", $s;
'
:ab

this behavier of unpack() is really interesting:), but I think he can
skip that colon by adding a 'x', like:

$ perl -le'
my $s = "3:abcd";
print unpack "Ax/A*", $s;
'
===print====
abc
=========

after checking up "Perl Pocket Reference", I found I dont even need
this '*', and I can use a number to replace 'x' coz of the way perl
handles "numeric+strings"......

print unpack "A2/A", $s;

but this is not robust, coz it works only on the fixed width records
which means the number of characters before colon should be fixed. so
this can not handle:

$s = "10:abcdefghijk";

which should use:

print unpack "A3/A", $s;

Xicheng
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top