Capturing a Repeated Group

Discussion in 'Perl Misc' started by perrog@gmail.com, Jul 11, 2007.

  1. Guest

    Hi!

    I'm new to perl/regular expressions but experience programmer. I'm
    trying to match a formatted number 123,456,789 and convert it into an
    integer. Thought the following would do, but it don't.

    $_ = "1,234,567,890";
    my @parts;
    (@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {
    my $number = 0;
    $number = $number * 1000 + $_ foreach (@parts);
    print "$number\n";
    };

    The problem is that this expression doesn't save repeated groups, it
    discard captured repeated groups except the last repeat (so it works
    for "123,456" numbers.) If I instead match the below it works, but I
    can't understand the logic behind it (if there is any logic?)

    (@parts = /(\d{1,3})(?:,(\d{3})+)/g) && do { # not really equivalent
    to the above, but almost
    my $number = 0;
    $number = $number * 1000 + $_ foreach (@parts);
    print "$number\n";
    };

    My second question is, if I can capture repeated groups, how do I know
    how many repeats there were. Is there any built-in/special variable
    other than $1, $2, etc. @+, @- or the returned array that I'm not
    aware of?

    Or can't I do it with RE's? Is this an duty for RecDescent? Life would
    be more compact with regular expressions. :)

    my $number_parser = Parse::RecDescent->new(q(
    parse: digits

    digits: /\d{1,3}/ <skip:''> digits_part(s?)
    {
    my $number = $item[1];
    $number = $number * 1000 + $_ foreach (@{$item[3]});
    $number;
    }

    digits_part: "," <skip:''> /\d\d\d/
    );

    $number_parser->parse("1,234,567,890"); # returns 1234567890

    I've searched around, including text books, but could not find any
    details how to capture repeated groups (if it now is possible.)

    Thanks for any hints.
    Regards,
    Roggan
     
    , Jul 11, 2007
    #1
    1. Advertising

  2. Paul Lalli Guest

    On Jul 11, 3:56 pm, "" <> wrote:
    > I'm new to perl/regular expressions but experience programmer. I'm
    > trying to match a formatted number 123,456,789 and convert it into an
    > integer.


    I have a dumb question. Why aren't you just doing:

    $_ = "1,234,567,890";
    s/,//g;

    or
    $_ = "1,234,567,890";
    tr/,//d;

    ?

    As for your generic question of "how do I capture individual instances
    of repeated captured submatches", I'm afraid I don't know the
    answer...

    Paul Lalli
     
    Paul Lalli, Jul 11, 2007
    #2
    1. Advertising

  3. Xicheng Jia Guest

    On Jul 11, 3:56 pm, "" <> wrote:
    > Hi!
    >
    > I'm new to perl/regular expressions but experience programmer. I'm
    > trying to match a formatted number 123,456,789 and convert it into an
    > integer. Thought the following would do, but it don't.
    >
    > $_ = "1,234,567,890";
    > my @parts;
    > (@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {


    you probably want this:

    @parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

    Regards,
    Xicheng


    > my $number = 0;
    > $number = $number * 1000 + $_ foreach (@parts);
    > print "$number\n";
    >
    > };
    >
     
    Xicheng Jia, Jul 11, 2007
    #3
  4. Xicheng Jia wrote:
    > On Jul 11, 3:56 pm, "" <> wrote:
    >> I'm trying to match a formatted number 123,456,789 and convert
    >> it into an integer. Thought the following would do, but it don't.
    >>
    >> $_ = "1,234,567,890";
    >> my @parts;
    >> (@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {

    >
    > you probably want this:
    >
    > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....


    Let's see:

    C:\home>type test.pl
    use warnings;
    $_ = '1,234,567,890';
    @parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
    print @parts, "\n";

    C:\home>test.pl
    (?=,(?:\d{3}))* matches null string many times in regex; marked by
    <-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
    line 3.
    1234567890

    Generates a warning; not so good...

    C:\home>type test.pl
    use warnings;
    $_ = '1,234,567,890';
    @parts = /(\d{1,3})(?=(?:,\d{3}|$))/g;
    print @parts, "\n";

    C:\home>test.pl
    1234567890

    That's better. :)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jul 12, 2007
    #4
  5. Xicheng Jia Guest

    On Jul 11, 7:55 pm, Gunnar Hjalmarsson <> wrote:
    > Xicheng Jia wrote:
    > > On Jul 11, 3:56 pm, "" <> wrote:
    > >> I'm trying to match a formatted number 123,456,789 and convert
    > >> it into an integer. Thought the following would do, but it don't.

    >
    > >> $_ = "1,234,567,890";
    > >> my @parts;
    > >> (@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {

    >
    > > you probably want this:

    >
    > > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

    >
    > Let's see:
    >
    > C:\home>type test.pl
    > use warnings;
    > $_ = '1,234,567,890';
    > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
    > print @parts, "\n";
    >
    > C:\home>test.pl
    > (?=,(?:\d{3}))* matches null string many times in regex; marked by
    > <-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
    > line 3.
    > 1234567890
    >


    err, I just copy/paste OP's code, and didnt test it. but the point was
    there: capturing only when needed

    > Generates a warning; not so good...
    >
    > C:\home>type test.pl
    > use warnings;
    > $_ = '1,234,567,890';
    > @parts = /(\d{1,3})(?=(?:,\d{3}|$))/g;
    > print @parts, "\n";
    > C:\home>test.pl
    > 1234567890
    >
    > That's better. :)


    I havenot used (?=...) with '*' by myself, and did not notice that
    before, but I think (?=....)? should be fine, like:

    @parts = /(\d{1,3})(?=,\d\d\d)?/g;

    you don't need the extra parentheses inside the (?= ...), right. :)

    Regards,
    Xicheng
     
    Xicheng Jia, Jul 12, 2007
    #5
  6. Xicheng Jia Guest

    On Jul 11, 11:09 pm, ""
    <> wrote:
    > On Jul 11, 12:56 pm, "" <> wrote:
    >
    >
    >
    >
    >
    > > Hi!

    >
    > > I'm new to perl/regular expressions but experience programmer. I'm
    > > trying to match a formatted number 123,456,789 and convert it into an
    > > integer. Thought the following would do, but it don't.

    >
    > > $_ = "1,234,567,890";
    > > my @parts;
    > > (@parts = /(\d{1,3})(?:,(\d{3}))*/g) && do {
    > > my $number = 0;
    > > $number = $number * 1000 + $_ foreach (@parts);
    > > print "$number\n";

    >
    > > };

    >
    > > The problem is that this expression doesn't save repeated groups, it
    > > discard captured repeated groups except the last repeat (so it works
    > > for "123,456" numbers.) If I instead match the below it works, but I
    > > can't understand the logic behind it (if there is any logic?)

    >
    > > (@parts = /(\d{1,3})(?:,(\d{3})+)/g) && do { # not really equivalent
    > > to the above, but almost
    > > my $number = 0;
    > > $number = $number * 1000 + $_ foreach (@parts);
    > > print "$number\n";

    >
    > > };

    >
    > The latter won't work for a target string like:
    >
    > $_ = '123,456,789';
    >
    > (i.e., anything with an odd number of comma delimited substrings).
    >
    > You can try global match (//g) in scalar context:
    >
    > $_ = "1,234,567,890";
    >
    > my $n = 0;
    > while (/\G(\d{1,3})(?:,|$)/g)


    this should be the same as:

    while (/\G(\d{1,3}),?/g)

    Regards,
    Xicheng


    > {
    > $n = $n * 1000 + $1;
    >
    > }
    >
    > print "$n\n";
    >
    > tr/,//d;
    >
    > if ($n == $_)
    > {
    > print "A string of numbers is converted to a number automagically
    > \n";
    >
    > }
    >
    > --
    > Hope this helps,
    > Steven- Hide quoted text -
    >
    > - Show quoted text -
     
    Xicheng Jia, Jul 12, 2007
    #6
  7. Xicheng Jia wrote:
    > Gunnar Hjalmarsson wrote:
    >>
    >> C:\home>type test.pl
    >> use warnings;
    >> $_ = '1,234,567,890';
    >> @parts = /(\d{1,3})(?=(?:,\d{3}|$))/g;
    >> print @parts, "\n";
    >> C:\home>test.pl
    >> 1234567890
    >>
    >> That's better. :)

    >
    > I havenot used (?=...) with '*' by myself, and did not notice that
    > before, but I think (?=....)? should be fine, like:
    >
    > @parts = /(\d{1,3})(?=,\d\d\d)?/g;


    Even if that doesn't trigger a warning, I believe it's in fact the same as

    @parts = /\d{1,3}/g;

    Please consider:

    C:\home>perl -e "print q/1,234 and 56,789/ =~ /(\d{1,3})(?=,\d\d\d)?/g"
    123456789
    C:\home>

    > you don't need the extra parentheses inside the (?= ...), right. :)


    Yes, in my variant above, which also does some validation, the purpose
    of the inner parentheses is to group the alternations.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jul 12, 2007
    #7
  8. -berlin.de Guest

    <> wrote in comp.lang.perl.misc:

    [...]

    > print "$n\n";
    >
    > tr/,//d;
    >
    > if ($n == $_)
    > {
    > print "A string of numbers is converted to a number automagically
    > \n";
    > }


    s/numbers/digits/;

    Anno
     
    -berlin.de, Jul 12, 2007
    #8
  9. Guest

    Thanks for the hint!

    On 11 Juli, 21:08, Paul Lalli <> wrote:
    > On Jul 11, 3:56 pm, "" <> wrote:
    >
    > > I'm new to perl/regular expressions but experience programmer. I'm
    > > trying to match a formatted number 123,456,789 and convert it into an
    > > integer.

    >
    > I have a dumb question. Why aren't you just doing:
    >
    > $_ = "1,234,567,890";
    > s/,//g;
    >
    > or
    > $_ = "1,234,567,890";
    > tr/,//d;
    >



    The short answer is that I'm "inchworming" my way through the string.
    The text may contain senteces with commas, and is not a single number
    string. And after the number is matches, I continue with other
    matches.

    Correct me if I'm wrong, but for my scenario I think substitutions
    requires two matches, first a hit, then a substitution, like so:
    $_ = "1,234,456,789";
    /\d{1,3}(?:,\d\d\d)*/g && do {
    my $number= $&;
    $number =~ s/,//g;
    print "$number\n";
    }

    But if the number parts could be eaten up in one regexp, it is
    unnecessarily to use two. :)
     
    , Jul 12, 2007
    #9
  10. Guest

    Thanks Xicheng, Hjalmarsson, Steven and Anno for your inputs!

    I'm really "inchworming" my way through the string, scanning tokens
    like a lexical analyzer. If it fails to scan numbers like
    "1,234,567,890", it continue to scan identifier token (e.g. /\w+(\d|
    \w)*/)

    On 12 Juli, 00:55, Gunnar Hjalmarsson <> wrote:
    > Xicheng Jia wrote:
    > > you probably want this:

    >
    > > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

    >
    > Let's see:
    >
    > C:\home>type test.pl
    > use warnings;
    > $_ = '1,234,567,890';
    > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
    > print @parts, "\n";
    >
    > C:\home>test.pl
    > (?=,(?:\d{3}))* matches null string many times in regex; marked by
    > <-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
    > line 3.
    > 1234567890
    >


    Even if perl reports matches null string many times, isn't this what I
    want?
    I want to match 123 or 1,234 or 1,234,567 or similar patterns.
    Since the regexp starts with /\d{1,3}/ it never matches the null
    string.
    Can't see for what patterns this fails...?

    Thanks for any hints!
     
    , Jul 12, 2007
    #10
  11. Paul Lalli Guest

    On Jul 12, 8:59 am, "" <> wrote:
    > The short answer is that I'm "inchworming" my way through the string.
    > The text may contain senteces with commas, and is not a single number
    > string. And after the number is matches, I continue with other
    > matches.


    Regexp::Common is your friend.

    > Correct me if I'm wrong, but for my scenario I think substitutions
    > requires two matches, first a hit, then a substitution, like so:
    > $_ = "1,234,456,789";
    > /\d{1,3}(?:,\d\d\d)*/g && do {
    > my $number= $&;
    > $number =~ s/,//g;
    > print "$number\n";
    >
    > }
    >
    > But if the number parts could be eaten up in one regexp, it is
    > unnecessarily to use two. :)


    Unnecessary, maybe, but a heck of a lot more readable.

    #!/opt2/perl/bin/perl
    use strict;
    use warnings;
    use Regexp::Common qw/number/;

    my @numbers;
    while (<DATA>) {
    push @numbers, /$RE{num}{int}{-sep=>','}/g;
    }
    tr/,//d for @numbers;
    print join(' - ', @numbers), "\n";

    __DATA__
    Lorem ipsum dolor sit amet, 1,234,567,890 consectetuer 1,000
    lacinia risus. 56,650,231 Duis 432 porta vehicula 8,103 ligula.

    $ ./nums.pl
    1234567890 - 1000 - 56650231 - 432 - 8103

    Paul Lalli
     
    Paul Lalli, Jul 12, 2007
    #11
  12. -berlin.de Guest

    <> wrote in comp.lang.perl.misc:
    >
    > Thanks Xicheng, Hjalmarsson, Steven and Anno for your inputs!
    >
    > I'm really "inchworming" my way through the string, scanning tokens
    > like a lexical analyzer. If it fails to scan numbers like
    > "1,234,567,890", it continue to scan identifier token (e.g. /\w+(\d|
    > \w)*/)


    That sounds like you want a real parser, where number recognition
    would be part of the general parsing process.

    > On 12 Juli, 00:55, Gunnar Hjalmarsson <> wrote:
    > > Xicheng Jia wrote:
    > > > you probably want this:

    > >
    > > > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

    > >
    > > Let's see:
    > >
    > > C:\home>type test.pl
    > > use warnings;
    > > $_ = '1,234,567,890';
    > > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
    > > print @parts, "\n";
    > >
    > > C:\home>test.pl
    > > (?=,(?:\d{3}))* matches null string many times in regex; marked by
    > > <-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
    > > line 3.
    > > 1234567890
    > >

    >
    > Even if perl reports matches null string many times, isn't this what I
    > want?
    > I want to match 123 or 1,234 or 1,234,567 or similar patterns.
    > Since the regexp starts with /\d{1,3}/ it never matches the null
    > string.
    > Can't see for what patterns this fails...?


    The pattern you're repeating is a *zero width* lookahead. Whatever
    the regex engine does internally to determine if it matches, the
    width of the match will be zero. That's what it's complaining about.
    The asterisk does nothing, you can remove it.

    Anno
     
    -berlin.de, Jul 12, 2007
    #12
  13. Guest

    On 12 Juli, 04:18, Xicheng Jia <> wrote:
    > On Jul 11, 11:09 pm, ""
    >
    > > The latter won't work for a target string like:
    > > $_ = '123,456,789';
    > > (i.e., anything with an odd number of comma delimited substrings).
    > > You can try global match (//g) in scalar context:
    > > $_ = "1,234,567,890";

    >
    > > my $n = 0;
    > > while (/\G(\d{1,3})(?:,|$)/g)

    >
    > this should be the same as:
    >
    > while (/\G(\d{1,3}),?/g)
    >


    Ohh, now I'm beginning to see the logic... :) The /(\d{1,3})(?:,
    (\d{3}))*/g rexexp captured repeated productions, not repeated groups.

    So, to sum up. I can't use /(\d{1,3})(?:,(\d\d\d))*/ because the RE
    engine only save captured repeated groups for the last iteration. The
    fix is to use g-modifier to capture repeated productions... the
    subject of this thread should really have been "capturing repeated
    productions", right? :)

    Ideally, /(\d{1,3})|(?<=\d{1,3}),(\d\d\d)/g would work, but (?<=
    \d{1,3}) is not implemented yet, so I ended up writing:

    @parts = ();
    (@parts = grep { defined $_ }
    m((\d{1,3})
    # (?<=\d{1,3}) not implemented, use three cases
    | (?<=\d),(\d\d\d)
    | (?<=\d\d),(\d\d\d)
    | (?<=\d\d\d),(\d\d\d)
    )xg) && do {
    my $number = 0;
    $number = $number * 1000 + $_ foreach (@parts);
    print "$number\n";
    };

    It uses a "Schwartzian transformation" to filter out undef captures,
    which I suppose comes from alternation cases.
     
    , Jul 12, 2007
    #13
  14. Dr.Ruud Guest

    schreef:

    > I want to match 123 or 1,234 or 1,234,567 or similar patterns.


    perldoc -f reverse

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jul 12, 2007
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. James Collier
    Replies:
    4
    Views:
    1,094
    Erik Heneryd
    Aug 12, 2004
  2. Replies:
    7
    Views:
    315
    Larry Bates
    Feb 24, 2006
  3. scsoce
    Replies:
    1
    Views:
    277
    Hrvoje Niksic
    Nov 21, 2008
  4. MRAB
    Replies:
    0
    Views:
    393
  5. candide
    Replies:
    3
    Views:
    224
    Vlastimil Brom
    Dec 14, 2011
Loading...

Share This Page