Capturing a Repeated Group

perrog · Jul 11, 2007

Hi!

I'm new to perl/regular expressions but experience programmer. I'm
trying to match a formatted number 123,456,789 and convert it into an
integer. Thought the following would do, but it don't.

$_ = "1,234,567,890";
my @parts;
(@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {
my $number = 0;
$number = $number * 1000 + $_ foreach (@parts);
print "$number\n";
};

The problem is that this expression doesn't save repeated groups, it
discard captured repeated groups except the last repeat (so it works
for "123,456" numbers.) If I instead match the below it works, but I
can't understand the logic behind it (if there is any logic?)

(@parts = /(\d{1,3})(?:,(\d{3})+)/g) && do { # not really equivalent
to the above, but almost
my $number = 0;
$number = $number * 1000 + $_ foreach (@parts);
print "$number\n";
};

My second question is, if I can capture repeated groups, how do I know
how many repeats there were. Is there any built-in/special variable
other than $1, $2, etc. @+, @- or the returned array that I'm not
aware of?

Or can't I do it with RE's? Is this an duty for RecDescent? Life would
be more compact with regular expressions.

my $number_parser = Parse::RecDescent->new(q(
parse: digits

digits: /\d{1,3}/ <skip:''> digits_part(s?)
{
my $number = $item[1];
$number = $number * 1000 + $_ foreach (@{$item[3]});
$number;
}

digits_part: "," <skip:''> /\d\d\d/
);

$number_parser->parse("1,234,567,890"); # returns 1234567890

I've searched around, including text books, but could not find any
details how to capture repeated groups (if it now is possible.)

Thanks for any hints.
Regards,
Roggan

Paul Lalli · Jul 11, 2007

I'm new to perl/regular expressions but experience programmer. I'm
trying to match a formatted number 123,456,789 and convert it into an
integer.

I have a dumb question. Why aren't you just doing:

$_ = "1,234,567,890";
s/,//g;

or
$_ = "1,234,567,890";
tr/,//d;

?

As for your generic question of "how do I capture individual instances
of repeated captured submatches", I'm afraid I don't know the
answer...

Paul Lalli

Xicheng Jia · Jul 11, 2007

Hi!

I'm new to perl/regular expressions but experience programmer. I'm
trying to match a formatted number 123,456,789 and convert it into an
integer. Thought the following would do, but it don't.

$_ = "1,234,567,890";
my @parts;
(@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {

you probably want this:

@parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

Regards,
Xicheng

Gunnar Hjalmarsson · Jul 12, 2007

Xicheng said:
you probably want this:

@parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

Let's see:

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
print @parts, "\n";

C:\home>test.pl
(?=,(?:\d{3}))* matches null string many times in regex; marked by
<-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
line 3.
1234567890

Generates a warning; not so good...

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=(?:,\d{3}|$))/g;
print @parts, "\n";

C:\home>test.pl
1234567890

That's better.

Xicheng Jia · Jul 12, 2007

Let's see:

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
print @parts, "\n";

C:\home>test.pl
(?=,(?:\d{3}))* matches null string many times in regex; marked by
<-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
line 3.
1234567890

err, I just copy/paste OP's code, and didnt test it. but the point was
there: capturing only when needed

Generates a warning; not so good...

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=(?:,\d{3}|$))/g;
print @parts, "\n";
C:\home>test.pl
1234567890

That's better.

I havenot used (?=...) with '*' by myself, and did not notice that
before, but I think (?=....)? should be fine, like:

@parts = /(\d{1,3})(?=,\d\d\d)?/g;

you don't need the extra parentheses inside the (?= ...), right.

Regards,
Xicheng

Xicheng Jia · Jul 12, 2007

The latter won't work for a target string like:

$_ = '123,456,789';

(i.e., anything with an odd number of comma delimited substrings).

You can try global match (//g) in scalar context:

$_ = "1,234,567,890";

my $n = 0;
while (/\G(\d{1,3})(?:,|$)/g)

this should be the same as:

while (/\G(\d{1,3}),?/g)

Regards,
Xicheng

Gunnar Hjalmarsson · Jul 12, 2007

Xicheng said:
I havenot used (?=...) with '*' by myself, and did not notice that
before, but I think (?=....)? should be fine, like:

@parts = /(\d{1,3})(?=,\d\d\d)?/g;

Even if that doesn't trigger a warning, I believe it's in fact the same as

@parts = /\d{1,3}/g;

Please consider:

C:\home>perl -e "print q/1,234 and 56,789/ =~ /(\d{1,3})(?=,\d\d\d)?/g"
123456789
C:\home>

you don't need the extra parentheses inside the (?= ...), right.

Yes, in my variant above, which also does some validation, the purpose
of the inner parentheses is to group the alternations.

anno4000 · Jul 12, 2007

[...]

print "$n\n";

tr/,//d;

if ($n == $_)
{
print "A string of numbers is converted to a number automagically
\n";
}

s/numbers/digits/;

Anno

perrog · Jul 12, 2007

Thanks for the hint!

I have a dumb question. Why aren't you just doing:

$_ = "1,234,567,890";
s/,//g;

or
$_ = "1,234,567,890";
tr/,//d;

The short answer is that I'm "inchworming" my way through the string.
The text may contain senteces with commas, and is not a single number
string. And after the number is matches, I continue with other
matches.

Correct me if I'm wrong, but for my scenario I think substitutions
requires two matches, first a hit, then a substitution, like so:
$_ = "1,234,456,789";
/\d{1,3}(?:,\d\d\d)*/g && do {
my $number= $&;
$number =~ s/,//g;
print "$number\n";
}

But if the number parts could be eaten up in one regexp, it is
unnecessarily to use two.

perrog · Jul 12, 2007

Thanks Xicheng, Hjalmarsson, Steven and Anno for your inputs!

I'm really "inchworming" my way through the string, scanning tokens
like a lexical analyzer. If it fails to scan numbers like
"1,234,567,890", it continue to scan identifier token (e.g. /\w+(\d|
\w)*/)

Let's see:

C:\home>type test.pl
use warnings;
$_ = '1,234,567,890';
@parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
print @parts, "\n";

C:\home>test.pl
(?=,(?:\d{3}))* matches null string many times in regex; marked by
<-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
line 3.
1234567890

Even if perl reports matches null string many times, isn't this what I
want?
I want to match 123 or 1,234 or 1,234,567 or similar patterns.
Since the regexp starts with /\d{1,3}/ it never matches the null
string.
Can't see for what patterns this fails...?

Thanks for any hints!

Paul Lalli · Jul 12, 2007

The short answer is that I'm "inchworming" my way through the string.
The text may contain senteces with commas, and is not a single number
string. And after the number is matches, I continue with other
matches.

Regexp::Common is your friend.

Correct me if I'm wrong, but for my scenario I think substitutions
requires two matches, first a hit, then a substitution, like so:
$_ = "1,234,456,789";
/\d{1,3}(?:,\d\d\d)*/g && do {
my $number= $&;
$number =~ s/,//g;
print "$number\n";

}

But if the number parts could be eaten up in one regexp, it is
unnecessarily to use two.

Unnecessary, maybe, but a heck of a lot more readable.

#!/opt2/perl/bin/perl
use strict;
use warnings;
use Regexp::Common qw/number/;

my @numbers;
while (<DATA>) {
push @numbers, /$RE{num}{int}{-sep=>','}/g;
}
tr/,//d for @numbers;
print join(' - ', @numbers), "\n";

__DATA__
Lorem ipsum dolor sit amet, 1,234,567,890 consectetuer 1,000
lacinia risus. 56,650,231 Duis 432 porta vehicula 8,103 ligula.

$ ./nums.pl
1234567890 - 1000 - 56650231 - 432 - 8103

Paul Lalli

anno4000 · Jul 12, 2007

Thanks Xicheng, Hjalmarsson, Steven and Anno for your inputs!

I'm really "inchworming" my way through the string, scanning tokens
like a lexical analyzer. If it fails to scan numbers like
"1,234,567,890", it continue to scan identifier token (e.g. /\w+(\d|
\w)*/)

That sounds like you want a real parser, where number recognition
would be part of the general parsing process.

Even if perl reports matches null string many times, isn't this what I
want?
I want to match 123 or 1,234 or 1,234,567 or similar patterns.
Since the regexp starts with /\d{1,3}/ it never matches the null
string.
Can't see for what patterns this fails...?

The pattern you're repeating is a *zero width* lookahead. Whatever
the regex engine does internally to determine if it matches, the
width of the match will be zero. That's what it's complaining about.
The asterisk does nothing, you can remove it.

Anno

perrog · Jul 12, 2007

On Jul 11, 11:09 pm, "(e-mail address removed)"

this should be the same as:

while (/\G(\d{1,3}),?/g)

Ohh, now I'm beginning to see the logic...

The /(\d{1,3})(?:,
(\d{3}))*/g rexexp captured repeated productions, not repeated groups.

So, to sum up. I can't use /(\d{1,3})(?:,(\d\d\d))*/ because the RE
engine only save captured repeated groups for the last iteration. The
fix is to use g-modifier to capture repeated productions... the
subject of this thread should really have been "capturing repeated
productions", right?

Ideally, /(\d{1,3})|(?<=\d{1,3}),(\d\d\d)/g would work, but (?<=
\d{1,3}) is not implemented yet, so I ended up writing:

@parts = ();
(@parts = grep { defined $_ }
m((\d{1,3})
# (?<=\d{1,3}) not implemented, use three cases
| (?<=\d),(\d\d\d)
| (?<=\d\d),(\d\d\d)
| (?<=\d\d\d),(\d\d\d)
)xg) && do {
my $number = 0;
$number = $number * 1000 + $_ foreach (@parts);
print "$number\n";
};

It uses a "Schwartzian transformation" to filter out undef captures,
which I suppose comes from alternation cases.

Dr.Ruud · Jul 12, 2007

(e-mail address removed) schreef:

I want to match 123 or 1,234 or 1,234,567 or similar patterns.

perldoc -f reverse

Regular expressions, capture repeated groups	4	Jul 8, 2010
Strange behavior of 'Alternative capture group numbering'	2	Jan 1, 2012
Removing Comma Within Digits	9	Nov 12, 2008
repeated calculations everyday	7	Jun 24, 2004
capturing a match	6	Feb 3, 2004
Regular expression fun. Repeated matching of a group Q	7	Feb 24, 2006
Regex to match a numerical IP range	7	Dec 11, 2010
FAQ 4.73 How do I determine whether a scalar is a number/whole/integer/float?	0	Jan 30, 2011

Capturing a Repeated Group

perrog

Paul Lalli

Xicheng Jia

Gunnar Hjalmarsson

Xicheng Jia

Xicheng Jia

Gunnar Hjalmarsson

anno4000

perrog

perrog

Paul Lalli

anno4000

perrog

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads