Matching multiple subexpressions in a regular expression

Discussion in 'Perl Misc' started by ShaunJ, Mar 11, 2008.

  1. ShaunJ

    ShaunJ Guest

    If more than one subepxression of a regex could have matched the
    string, is it possible to find all the subexpressions that could have
    matched? Example...

    my $re = qr/([AC]CT)|(AC[CT])/;
    'CCT' =~ m/$re/;
    print join ',', @-; print "\n";
    'ACC' =~ m/$re/;
    print join ',', @-; print "\n";
    'ACT' =~ m/$re/;
    print join ',', @-; print "\n";

    Output:
    0,0
    0,,0
    0,0

    This shows that for...
    CCT: the first subexpression matched
    ACC: the second subexpression matched
    ACT: the first subexpression matched

    However, ACT matched both subexpressions! The ideal result for ACT
    would be...
    0,0,0
    showing that both subexpressions matched. Is this possible without
    having to split each subexpression into its own regular expression? My
    understanding -- please correct me if I'm wrong -- is that one big
    regular expression will run faster than 100 little ones, since the one
    big regular expression can be compiled into a single large finite-
    state-machine that is more efficient than running 100 little FSM.

    Thanks!
    Shaun
     
    ShaunJ, Mar 11, 2008
    #1
    1. Advertising

  2. ShaunJ wrote:
    > If more than one subepxression of a regex could have matched the
    > string, is it possible to find all the subexpressions that could have
    > matched? Example...
    >
    > my $re = qr/([AC]CT)|(AC[CT])/;
    > 'CCT' =~ m/$re/;
    > print join ',', @-; print "\n";
    > 'ACC' =~ m/$re/;
    > print join ',', @-; print "\n";
    > 'ACT' =~ m/$re/;
    > print join ',', @-; print "\n";
    >
    > Output:
    > 0,0
    > 0,,0
    > 0,0
    >
    > This shows that for...
    > CCT: the first subexpression matched
    > ACC: the second subexpression matched
    > ACT: the first subexpression matched
    >
    > However, ACT matched both subexpressions! The ideal result for ACT
    > would be...
    > 0,0,0
    > showing that both subexpressions matched.


    Perl's regular expressions can't do that. They always stop after a
    successful match so either ([AC]CT) would match or (AC[CT]) would match
    but never both.

    > Is this possible without
    > having to split each subexpression into its own regular expression? My
    > understanding -- please correct me if I'm wrong -- is that one big
    > regular expression will run faster than 100 little ones,


    Your assumption is incorrect.

    > since the one
    > big regular expression can be compiled into a single large finite-
    > state-machine that is more efficient than running 100 little FSM.


    That question is answered in perlfaq6:

    perldoc -q "How do I efficiently match many regular expressions at once"



    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
     
    John W. Krahn, Mar 12, 2008
    #2
    1. Advertising

  3. ShaunJ

    ShaunJ Guest

    On Mar 11, 6:13 pm, "John W. Krahn" <> wrote:
    ....
    > > since the one
    > > big regular expression can be compiled into a single large finite-
    > > state-machine that is more efficient than running 100 little FSM.

    >
    > That question is answered in perlfaq6:
    >
    > perldoc -q "How do I efficiently match many regular expressions at once"


    Hi John,

    If I structure my program as in the example, using many small regex
    instead of one big regex, Perl 5.8.6 runs out of memory and dies:
    vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
    characters each, and the input string is one line 100 kB long. The
    machine has 2 GB of memory and free disk space, which should be
    enough, so I presume the code is somehow leeking memory. It's only a
    dozen or so lines long, so I've posted my code below. Can you see an
    obvious leak?

    Thanks,
    Shaun

    my @restrings = <REFILE>;
    my @re = map { qr/$_/x } @restrings;
    while (<>) {
    my $i = 0;
    foreach my $re (@re) {
    $i++;
    pos = 0;
    while (m/$re/g) {
    print $i, "\t",
    $LAST_MATCH_START[0] + 1, "\t",
    $&, "\n";
    pos = $LAST_MATCH_START[0] + 1;
    }
    }
    }
     
    ShaunJ, Mar 12, 2008
    #3
  4. [A complimentary Cc of this posting was sent to
    John W. Krahn
    <>], who wrote in article <k7GBj.100243$C61.89884@edtnps89>:
    > > my $re = qr/([AC]CT)|(AC[CT])/;
    > > 'ACT' =~ m/$re/;
    > > print join ',', @-; print "\n";
    > >
    > > Output:
    > > 0,0
    > >
    > > This shows that for...
    > > ACT: the first subexpression matched
    > >
    > > However, ACT matched both subexpressions! The ideal result for ACT
    > > would be...
    > > 0,0,0
    > > showing that both subexpressions matched.

    >
    > Perl's regular expressions can't do that. They always stop after a
    > successful match so either ([AC]CT) would match or (AC[CT]) would match
    > but never both.


    Perl's regular expressions can do it. You just follow the match by
    (?!), and preceed it by (??{code}) which analizes and stores the match
    results.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Mar 12, 2008
    #4
  5. ShaunJ

    ShaunJ Guest

    On Mar 12, 7:40 pm, Frank Seitz <> wrote:
    > ShaunJ wrote:
    > > If I structure my program as in the example, using many small regex
    > > instead of one big regex, Perl 5.8.6 runs out of memory and dies:
    > > vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
    > > characters each, and the input string is one line 100 kB long.

    > [...]
    > > my @restrings = <REFILE>;
    > > my @re = map { qr/$_/x } @restrings;

    >
    > 400'000 precompiled regexes are quite a lot!
    > Why don't you read and create them in chunks of, say, 1000?


    Hi Frank,

    400'000 * 27 = 12 MB. I wouldn't say it's that large. It seems
    reasonable that the compiled regex would fit in less than one GB of
    memory, which is my only requirement.

    In any case, the following 9-line code snippet burns through 100MB of
    memory a second using Perl 5.8.6! Certainly a memory leak. The only
    explanation I can think of is if the m/$re/g expression were
    recompiling the regex every time and the previously compiled regex
    weren't being discarded.

    Thanks,
    Shaun

    my @restrings = <REFILE>;
    my @re = map { qr/$_/x } @restrings;
    while (<>) {
    foreach my $re (@re) {
    while (m/$re/g) {
    print $LAST_MATCH_START[0], "\n";
    }
    }
    }
     
    ShaunJ, Mar 13, 2008
    #5
  6. ShaunJ

    ShaunJ Guest

    On Mar 13, 1:08 pm, Frank Seitz <> wrote:
    ....
    > > In any case, the following 9-line code snippet burns through 100MB of
    > > memory a second using Perl 5.8.6! Certainly a memory leak. The only
    > > explanation I can think of is if the m/$re/g expression were
    > > recompiling the regex every time and the previously compiled regex
    > > weren't being discarded.

    >
    > No, Perl precompiles the patterns (your 400'000 regexes)
    > into an internal representation at the moment of qr//,
    > that is at the beginning of your program.


    That is my intention. A quick experiment with `top` shows that the
    400'000 regex use 500 MB of memory once they're compiled, which is
    fine by me. It's the following loop that leaks memory like crazy, and
    it shouldn't use any additional memory. Any ideas why?

    Swapping the loops won't have any effect, as there's only one string
    (one line in the input file) for the first while(<>) loop.

    Cheers,
    Shaun
     
    ShaunJ, Mar 13, 2008
    #6
  7. ShaunJ

    Guest

    ShaunJ <> wrote:

    > Hi John,
    >
    > If I structure my program as in the example, using many small regex
    > instead of one big regex, Perl 5.8.6 runs out of memory and dies:
    > vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
    > characters each, and the input string is one line 100 kB long. The
    > machine has 2 GB of memory and free disk space, which should be
    > enough, so I presume the code is somehow leeking memory. It's only a
    > dozen or so lines long, so I've posted my code below. Can you see an
    > obvious leak?
    >
    > Thanks,
    > Shaun
    >
    > my @restrings = <REFILE>;
    > my @re = map { qr/$_/x } @restrings;
    > while (<>) {

    ....

    Can you produce a version that we can run? (I.e. that doesn't
    depend on REFILE or STDIN, which we don't have access to?)

    The below doesn't leak on v5.8.3, 5.8.7, or 5.8.8.

    use strict;
    use warnings;
    use English;

    my @re = map { qr/$_/x } '000000'..'400000';
    push @re, qr/\d/; #just to make sure something matches

    foreach
    ('000000000000000000000000000000'..'000000000000000000000001000000') {
    print "$_\n";
    my $i = 0;
    foreach my $re (@re) {
    $i++;
    pos = 0;
    while (m/$re/g) {
    print $i, "\t",
    $LAST_MATCH_START[0] + 1, "\t",
    $&, "\n";
    pos = $LAST_MATCH_START[0] + 1;
    }
    }
    }
    __END__

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Mar 13, 2008
    #7
  8. ShaunJ

    ShaunJ Guest

    On Mar 13, 2:28 pm, wrote:
    > ShaunJ <> wrote:
    > > Hi John,

    >
    > > If I structure my program as in the example, using many small regex
    > > instead of one big regex, Perl 5.8.6 runs out of memory and dies:
    > > vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
    > > characters each, and the input string is one line 100 kB long. The
    > > machine has 2 GB of memory and free disk space, which should be
    > > enough, so I presume the code is somehow leeking memory. It's only a
    > > dozen or so lines long, so I've posted my code below. Can you see an
    > > obvious leak?

    >
    > > Thanks,
    > > Shaun

    >
    > > my @restrings = <REFILE>;
    > > my @re = map { qr/$_/x } @restrings;
    > > while (<>) {

    >
    > ...
    >
    > Can you produce a version that we can run? (I.e. that doesn't
    > depend on REFILE or STDIN, which we don't have access to?)


    Yes, see the recent thread 'm// on very long lines leaks memory',
    where I gave a small test case. As it turns out, there is a bug in
    Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
    either English or $& causes the memory leak. This bug is fixed in
    5.10.0.

    Cheers,
    Shaun
     
    ShaunJ, Mar 13, 2008
    #8
  9. ShaunJ

    Uri Guttman Guest

    >>>>> "S" == ShaunJ <> writes:

    S> On Mar 13, 2:28 pm, wrote:
    >> ShaunJ <> wrote:
    >> > Hi John,

    >>
    >> > If I structure my program as in the example, using many small regex
    >> > instead of one big regex, Perl 5.8.6 runs out of memory and dies:
    >> > vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
    >> > characters each, and the input string is one line 100 kB long. The
    >> > machine has 2 GB of memory and free disk space, which should be
    >> > enough, so I presume the code is somehow leeking memory. It's only a
    >> > dozen or so lines long, so I've posted my code below. Can you see an
    >> > obvious leak?

    >>
    >> > Thanks,
    >> > Shaun

    >>
    >> > my @restrings = <REFILE>;
    >> > my @re = map { qr/$_/x } @restrings;
    >> > while (<>) {

    >>
    >> ...
    >>
    >> Can you produce a version that we can run? (I.e. that doesn't
    >> depend on REFILE or STDIN, which we don't have access to?)


    S> Yes, see the recent thread 'm// on very long lines leaks memory',
    S> where I gave a small test case. As it turns out, there is a bug in
    S> Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
    S> either English or $& causes the memory leak. This bug is fixed in
    S> 5.10.0.

    it is not a leak (as someone else proved in another post). so don't go
    blabbing that it is a leak. it is a well known ram suck but it doesn't
    lose the ram like a true leak does.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Architecture, Development, Training, Support, Code Review ------
    ----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, Mar 13, 2008
    #9
  10. ShaunJ

    Dr.Ruud Guest

    ShaunJ schreef:

    > As it turns out, there is a bug in
    > Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
    > either English or $& causes the memory leak. This bug is fixed in
    > 5.10.0.


    It is neither a leak nor a bug. Read perldoc perlre.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Mar 16, 2008
    #10
  11. [A complimentary Cc of this posting was NOT [per weedlist] sent to
    Dr.Ruud
    <>], who wrote in article <>:
    > ShaunJ schreef:
    >
    > > As it turns out, there is a bug in
    > > Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
    > > either English or $& causes the memory leak. This bug is fixed in
    > > 5.10.0.

    >
    > It is neither a leak nor a bug. Read perldoc perlre.


    If you think it is not a bug, please explain what is the purpose of
    the stored information.

    Thanks,
    Ilya
     
    Ilya Zakharevich, Mar 16, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,389
  2. Replies:
    4
    Views:
    586
  3. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    884
    Alan Moore
    Dec 2, 2005
  4. pinkisntwell
    Replies:
    5
    Views:
    747
    Gabriel Genellina
    Nov 10, 2009
  5. pinkisntwell
    Replies:
    4
    Views:
    133
    J├╝rgen Exner
    Nov 9, 2009
Loading...

Share This Page