Capturing a Repeated Group

P

perrog

Hi!

I'm new to perl/regular expressions but experience programmer. I'm
trying to match a formatted number 123,456,789 and convert it into an
integer. Thought the following would do, but it don't.

$_ = "1,234,567,890";
my @parts;
(@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {
my $number = 0;
$number = $number * 1000 + $_ foreach (@parts);
print "$number\n";
};

The problem is that this expression doesn't save repeated groups, it
discard captured repeated groups except the last repeat (so it works
for "123,456" numbers.) If I instead match the below it works, but I
can't understand the logic behind it (if there is any logic?)

(@parts = /(\d{1,3})(?:,(\d{3})+)/g) && do { # not really equivalent
to the above, but almost
my $number = 0;
$number = $number * 1000 + $_ foreach (@parts);
print "$number\n";
};

My second question is, if I can capture repeated groups, how do I know
how many repeats there were. Is there any built-in/special variable
other than $1, $2, etc. @+, @- or the returned array that I'm not
aware of?

Or can't I do it with RE's? Is this an duty for RecDescent? Life would
be more compact with regular expressions. :)

my $number_parser = Parse::RecDescent->new(q(
parse: digits

digits: /\d{1,3}/ <skip:''> digits_part(s?)
{
my $number = $item[1];
$number = $number * 1000 + $_ foreach (@{$item[3]});
$number;
}

digits_part: "," <skip:''> /\d\d\d/
);

$number_parser->parse("1,234,567,890"); # returns 1234567890

I've searched around, including text books, but could not find any
details how to capture repeated groups (if it now is possible.)

Thanks for any hints.
Regards,
Roggan
 
P

Paul Lalli

I'm new to perl/regular expressions but experience programmer. I'm
trying to match a formatted number 123,456,789 and convert it into an
integer.

I have a dumb question. Why aren't you just doing:

$_ = "1,234,567,890";
s/,//g;

or
$_ = "1,234,567,890";
tr/,//d;

?

As for your generic question of "how do I capture individual instances
of repeated captured submatches", I'm afraid I don't know the
answer...

Paul Lalli
 
X

Xicheng Jia

Hi!

I'm new to perl/regular expressions but experience programmer. I'm
trying to match a formatted number 123,456,789 and convert it into an
integer. Thought the following would do, but it don't.

$_ = "1,234,567,890";
my @parts;
(@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {

you probably want this:

@parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

Regards,
Xicheng
 
G

Gunnar Hjalmarsson

Xicheng said:
you probably want this:

@parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

Let's see:

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
print @parts, "\n";

C:\home>test.pl
(?=,(?:\d{3}))* matches null string many times in regex; marked by
<-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
line 3.
1234567890

Generates a warning; not so good...

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=(?:,\d{3}|$))/g;
print @parts, "\n";

C:\home>test.pl
1234567890

That's better. :)
 
X

Xicheng Jia

Let's see:

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
print @parts, "\n";

C:\home>test.pl
(?=,(?:\d{3}))* matches null string many times in regex; marked by
<-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
line 3.
1234567890

err, I just copy/paste OP's code, and didnt test it. but the point was
there: capturing only when needed
Generates a warning; not so good...

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=(?:,\d{3}|$))/g;
print @parts, "\n";
C:\home>test.pl
1234567890

That's better. :)

I havenot used (?=...) with '*' by myself, and did not notice that
before, but I think (?=....)? should be fine, like:

@parts = /(\d{1,3})(?=,\d\d\d)?/g;

you don't need the extra parentheses inside the (?= ...), right. :)

Regards,
Xicheng
 
X

Xicheng Jia

The latter won't work for a target string like:

$_ = '123,456,789';

(i.e., anything with an odd number of comma delimited substrings).

You can try global match (//g) in scalar context:

$_ = "1,234,567,890";

my $n = 0;
while (/\G(\d{1,3})(?:,|$)/g)

this should be the same as:

while (/\G(\d{1,3}),?/g)

Regards,
Xicheng
 
G

Gunnar Hjalmarsson

Xicheng said:
I havenot used (?=...) with '*' by myself, and did not notice that
before, but I think (?=....)? should be fine, like:

@parts = /(\d{1,3})(?=,\d\d\d)?/g;

Even if that doesn't trigger a warning, I believe it's in fact the same as

@parts = /\d{1,3}/g;

Please consider:

C:\home>perl -e "print q/1,234 and 56,789/ =~ /(\d{1,3})(?=,\d\d\d)?/g"
123456789
C:\home>
you don't need the extra parentheses inside the (?= ...), right. :)

Yes, in my variant above, which also does some validation, the purpose
of the inner parentheses is to group the alternations.
 
P

perrog

Thanks for the hint!

I have a dumb question. Why aren't you just doing:

$_ = "1,234,567,890";
s/,//g;

or
$_ = "1,234,567,890";
tr/,//d;


The short answer is that I'm "inchworming" my way through the string.
The text may contain senteces with commas, and is not a single number
string. And after the number is matches, I continue with other
matches.

Correct me if I'm wrong, but for my scenario I think substitutions
requires two matches, first a hit, then a substitution, like so:
$_ = "1,234,456,789";
/\d{1,3}(?:,\d\d\d)*/g && do {
my $number= $&;
$number =~ s/,//g;
print "$number\n";
}

But if the number parts could be eaten up in one regexp, it is
unnecessarily to use two. :)
 
P

perrog

Thanks Xicheng, Hjalmarsson, Steven and Anno for your inputs!

I'm really "inchworming" my way through the string, scanning tokens
like a lexical analyzer. If it fails to scan numbers like
"1,234,567,890", it continue to scan identifier token (e.g. /\w+(\d|
\w)*/)

Let's see:

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
print @parts, "\n";

C:\home>test.pl
(?=,(?:\d{3}))* matches null string many times in regex; marked by
<-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
line 3.
1234567890

Even if perl reports matches null string many times, isn't this what I
want?
I want to match 123 or 1,234 or 1,234,567 or similar patterns.
Since the regexp starts with /\d{1,3}/ it never matches the null
string.
Can't see for what patterns this fails...?

Thanks for any hints!
 
P

Paul Lalli

The short answer is that I'm "inchworming" my way through the string.
The text may contain senteces with commas, and is not a single number
string. And after the number is matches, I continue with other
matches.

Regexp::Common is your friend.
Correct me if I'm wrong, but for my scenario I think substitutions
requires two matches, first a hit, then a substitution, like so:
$_ = "1,234,456,789";
/\d{1,3}(?:,\d\d\d)*/g && do {
my $number= $&;
$number =~ s/,//g;
print "$number\n";

}

But if the number parts could be eaten up in one regexp, it is
unnecessarily to use two. :)

Unnecessary, maybe, but a heck of a lot more readable.

#!/opt2/perl/bin/perl
use strict;
use warnings;
use Regexp::Common qw/number/;

my @numbers;
while (<DATA>) {
push @numbers, /$RE{num}{int}{-sep=>','}/g;
}
tr/,//d for @numbers;
print join(' - ', @numbers), "\n";

__DATA__
Lorem ipsum dolor sit amet, 1,234,567,890 consectetuer 1,000
lacinia risus. 56,650,231 Duis 432 porta vehicula 8,103 ligula.

$ ./nums.pl
1234567890 - 1000 - 56650231 - 432 - 8103

Paul Lalli
 
A

anno4000

Thanks Xicheng, Hjalmarsson, Steven and Anno for your inputs!

I'm really "inchworming" my way through the string, scanning tokens
like a lexical analyzer. If it fails to scan numbers like
"1,234,567,890", it continue to scan identifier token (e.g. /\w+(\d|
\w)*/)

That sounds like you want a real parser, where number recognition
would be part of the general parsing process.
Even if perl reports matches null string many times, isn't this what I
want?
I want to match 123 or 1,234 or 1,234,567 or similar patterns.
Since the regexp starts with /\d{1,3}/ it never matches the null
string.
Can't see for what patterns this fails...?

The pattern you're repeating is a *zero width* lookahead. Whatever
the regex engine does internally to determine if it matches, the
width of the match will be zero. That's what it's complaining about.
The asterisk does nothing, you can remove it.

Anno
 
P

perrog

On Jul 11, 11:09 pm, "(e-mail address removed)"



this should be the same as:

while (/\G(\d{1,3}),?/g)

Ohh, now I'm beginning to see the logic... :) The /(\d{1,3})(?:,
(\d{3}))*/g rexexp captured repeated productions, not repeated groups.

So, to sum up. I can't use /(\d{1,3})(?:,(\d\d\d))*/ because the RE
engine only save captured repeated groups for the last iteration. The
fix is to use g-modifier to capture repeated productions... the
subject of this thread should really have been "capturing repeated
productions", right? :)

Ideally, /(\d{1,3})|(?<=\d{1,3}),(\d\d\d)/g would work, but (?<=
\d{1,3}) is not implemented yet, so I ended up writing:

@parts = ();
(@parts = grep { defined $_ }
m((\d{1,3})
# (?<=\d{1,3}) not implemented, use three cases
| (?<=\d),(\d\d\d)
| (?<=\d\d),(\d\d\d)
| (?<=\d\d\d),(\d\d\d)
)xg) && do {
my $number = 0;
$number = $number * 1000 + $_ foreach (@parts);
print "$number\n";
};

It uses a "Schwartzian transformation" to filter out undef captures,
which I suppose comes from alternation cases.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top