Problem with memory usage in pattern match

Discussion in 'Perl Misc' started by Niall Macpherson, Dec 5, 2005.

  1. I need to read a line which contains a [ and a ] and then capture the
    data between these symbols. I then need to split the captured data
    using the open parenthesis as the separator.

    The following code does exactly what I want

    use strict;
    use warnings;

    while(<DATA>)
    {
    if(/.*\[(.*)\]/)
    {
    my $second_bit = $1;
    while ( $second_bit =~ /
    \( # match open paren
    ([^\(]*) # match anything up to next opening paren
    /gx )
    {
    print $1, "\n";
    }
    }
    else
    {
    print "\n", 'No match';
    }
    }
    __END__
    (lose this stuff[(process(this(stuff(bit(by(bit]

    And produces the following output

    H:\Perl Scripts>splitnoarray.pl
    process
    this
    stuff
    bit
    by
    bit

    H:\Perl Scripts>

    Which is fine for a short line of input.
    However the real data is very large - each line may be up to 1Gb and
    the memory usage and performance is terrible.

    I had a look at getting rid of the '$second_bit' variable to avoid
    doing a very large assignment but if I use $1 directly in the while
    loop I get the following

    H:\Perl Scripts>splitnoarray.pl
    process

    H:\Perl Scripts>

    Can anyone suggest a better way of doing this which would minimise the
    memory usage ?

    Thanks in advance

    Niall
    Niall Macpherson, Dec 5, 2005
    #1
    1. Advertising

  2. Niall Macpherson

    Guest

    "Niall Macpherson" <> wrote:
    > I need to read a line which contains a [ and a ] and then capture the
    > data between these symbols. I then need to split the captured data
    > using the open parenthesis as the separator.


    If you need to split it, why not "split" it? How many "(" do you expect?

    >
    > The following code does exactly what I want
    >
    > use strict;
    > use warnings;
    >
    > while(<DATA>)
    > {
    > if(/.*\[(.*)\]/)


    The captured data will span from the last [ which has a ] after it to the
    last ]. Is that necessary? What if you just spanned from the first [ to
    the first ] after it? Would that still be correct? (I.e. can your data
    have nested [])?

    > {
    > my $second_bit = $1;
    > while ( $second_bit =~ /
    > \( # match open
    > paren ([^\(]*) # match
    > anything up to next opening paren
    > /gx ) {
    > print $1, "\n";
    > }



    ##Maybe this would be better (or maybe not)

    foreach ( split /\(/, $1) {


    > }
    > else
    > {
    > print "\n", 'No match';
    > }
    > }
    > __END__
    > (lose this stuff[(process(this(stuff(bit(by(bit]
    >
    > And produces the following output
    >
    > H:\Perl Scripts>splitnoarray.pl
    > process
    > this
    > stuff
    > bit
    > by
    > bit
    >
    > H:\Perl Scripts>
    >
    > Which is fine for a short line of input.
    > However the real data is very large - each line may be up to 1Gb and
    > the memory usage and performance is terrible.


    I certainly wouldn't expect otherwise, operating on 1Gb strings.

    >
    > I had a look at getting rid of the '$second_bit' variable to avoid
    > doing a very large assignment but if I use $1 directly in the while
    > loop I get the following
    >
    > H:\Perl Scripts>splitnoarray.pl
    > process


    That is because both of your regex are fighting over the same $1. Trying
    to split $1 rather than operating on it with a capturing regex might be
    better.

    >
    > H:\Perl Scripts>
    >
    > Can anyone suggest a better way of doing this which would minimise the
    > memory usage ?


    I don't know if it would work in your exact situation, but I might try
    something like this:

    $/ = '[';
    <$fh>; #skip to start of what I want;
    $/ = '(';
    while (<$fh>) {
    my ($stuff,$rest) = split /(])/;
    print "$stuff\n";
    last if defined $rest;
    };


    (Although ideally I would reconsider whatever process it is that is making
    this abominable file in the first place)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Dec 5, 2005
    #2
    1. Advertising

  3. Niall Macpherson

    Anno Siegel Guest

    Niall Macpherson <> wrote in comp.lang.perl.misc:
    > I need to read a line which contains a [ and a ] and then capture the
    > data between these symbols....


    [solution snipped]

    > Which is fine for a short line of input.
    > However the real data is very large - each line may be up to 1Gb and
    > the memory usage and performance is terrible.


    [...]

    > Can anyone suggest a better way of doing this which would minimise the
    > memory usage ?


    It's been a while, but I find this problem intriguing.

    If the whole string can't be read in memory we must use a buffer. The
    problem is now that buffers can arbitrarily overlap with the bracketed
    portions of the string.

    Here is my take:

    use constant SIZE => 5; # buffer size
    my ( $open, $close) = map quotemeta, qw( [ ]); # delimiters

    my (
    @res, # result array to hold the bracketed portions
    $carry, # bracketed portions whose end hasn't been reached
    $inside, # flag, set if the last buffer ended inside a pair of []
    );

    while ( read( DATA, $_, SIZE) ) {

    # deal with initial part if inside brackets
    if ( $inside ) {
    if ( /(.*?)$close/g ) {
    # finished with this pair of []
    push @res, $carry . $1; # put result away
    $inside = 0; # no longer inside
    }
    else {
    # whole buffer goes to $carry
    $carry .= $_;
    pos = length; # done with this buffer, still inside
    }
    }

    # collect contents of complete pairs of [...] in this buffer
    my $pos = pos;
    while ( /$open(.*?)$close/g ) {
    push @res, $1;
    $pos = pos;
    }
    pos = $pos;

    # does another (unfinished) pair of [] start in this buffer?
    if ( /$open(.*)/g ) {
    $carry = $1;
    $inside = 1;
    }
    }

    print "$_\n" for @res;


    __DATA__
    [j]outside[inside_1]out again[inside 2]out once more[k][l]x[m]boo


    This program never holds more than a buffer full of raw data and
    works down to a buffer size of 1. It uses $_ as the buffer and
    the pos() function to keep track of where in the buffer we stand.
    The latter takes a bit of fiddling, there may be better ways.

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, Dec 9, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. metfan
    Replies:
    2
    Views:
    4,845
    Robert Olofsson
    Oct 21, 2003
  2. hvt
    Replies:
    0
    Views:
    1,206
  3. hvt
    Replies:
    0
    Views:
    1,463
  4. Krist
    Replies:
    8
    Views:
    6,421
    Arne Vajhøj
    Feb 10, 2010
  5. MrsEntity
    Replies:
    20
    Views:
    473
Loading...

Share This Page