regexp s// too greedy

Discussion in 'Perl Misc' started by bettyann, Nov 11, 2004.

  1. bettyann

    bettyann Guest

    hi all,

    can anyone help me limit the greediness of my substitution pattern? i
    have a CSV file and i want to insert a new column of values after the
    6th column. but the new data to be inserted is dependent upon the
    value of the 6th column.

    example original data:
    2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
    1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
    5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
    8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1

    i want to put "0" after the 6th column if the 6th column contains
    "hold.bmp".
    i want to put "-1" after the 6th column if the 6th column contains
    "NaN".

    i thought i could do this with two substitutions commands:

    s/^((.*?,){5}?(hold.bmp))/$1,0/
    s/^((.*?,){5}?(NaN))/$1,-1/

    i cannot limit the matching of "hold.bmp" or "NaN". i want this
    pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
    5th column.

    my test code:
    #!/usr/local/bin/perl

    use strict;
    use warnings;

    my $input = <<EOF;
    2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
    1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
    5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
    8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1
    EOF

    my @oData = split( '\n', $input );
    my $line;
    my $cnt = 0;
    foreach $line ( @oData ) {
    printf( "$cnt) $line \n" );
    $cnt++;
    }

    my $prevCol = 5;
    my @txtList = ( "hold.bmp", "NaN" );
    my @valList = ( "0", "-1" );
    my ( $txt, $cmd, $i );
    $i = 0;
    foreach $txt ( @txtList ) {
    $cmd = sprintf( '$line =~ s/^((.*?,){%d}?(%s))/$1,%s/;',
    $prevCol, $txt, $valList[$i] );
    printf( "\ncmd >>$cmd<< \n" );
    foreach $line ( @oData ) {
    printf( "orig line |$line| \n" );
    eval $cmd;
    printf( " new line |$line| \n---------------------\n" );
    }
    $i++;
    }

    exit;

    output:
    % test2.pl
    0) 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
    1) 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
    2) 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
    3) 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1

    cmd >>$line =~ s/^((.*?,){5}?(hold.bmp))/$1,0/;<<
    orig line |2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1|
    new line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,NaN,NaN,hold.bmp,NaN,1|
    ---------------------
    orig line |1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1|
    new line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,NaN,NaN,hold.bmp,3,1|
    ---------------------
    orig line |5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1|
    new line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,NaN,8,go.bmp,NaN,1|
    ---------------------
    orig line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1|
    new line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,0,NaN,1|
    ---------------------

    cmd >>$line =~ s/^((.*?,){5}?(NaN))/$1,-1/;<<
    orig line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,NaN,NaN,hold.bmp,NaN,1|
    new line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,-1,NaN,NaN,hold.bmp,NaN,1|
    ---------------------
    orig line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,NaN,NaN,hold.bmp,3,1|
    new line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,-1,NaN,NaN,hold.bmp,3,1|
    ---------------------
    orig line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,NaN,8,go.bmp,NaN,1|
    new line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,-1,NaN,8,go.bmp,NaN,1|
    ---------------------
    orig line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,0,NaN,1|
    new line |8,NaN,NaN,4,32,NaN,-1,1607564,NaN,NaN,8,hold.bmp,0,NaN,1|
    ---------------------

    thanks,
    - bettyann
    bettyann, Nov 11, 2004
    #1
    1. Advertising

  2. bettyann

    Stuart Moore Guest

    bettyann wrote:

    > hi all,
    >
    > can anyone help me limit the greediness of my substitution pattern? i
    > have a CSV file and i want to insert a new column of values after the
    > 6th column. but the new data to be inserted is dependent upon the
    > value of the 6th column.
    >
    > example original data:
    > 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
    > 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
    > 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
    > 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1
    >
    > i want to put "0" after the 6th column if the 6th column contains
    > "hold.bmp".
    > i want to put "-1" after the 6th column if the 6th column contains
    > "NaN".
    >
    > i thought i could do this with two substitutions commands:
    >
    > s/^((.*?,){5}?(hold.bmp))/$1,0/
    > s/^((.*?,){5}?(NaN))/$1,-1/


    ^ Not sure that you want that ?

    I suggest replacing (.*?,) with ([^,]*) assuming there isn't some way of
    commas appearing escaped within the data.
    Stuart Moore, Nov 11, 2004
    #2
    1. Advertising

  3. bettyann

    Stuart Moore Guest

    Stuart Moore wrote:

    > I suggest replacing (.*?,) with ([^,]*) assuming there isn't some way of
    > commas appearing escaped within the data.


    That should have been ([^,]*,) of course
    Stuart Moore, Nov 11, 2004
    #3
  4. bettyann wrote:
    > can anyone help me limit the greediness of my substitution pattern? i
    > have a CSV file and i want to insert a new column of values after the
    > 6th column. but the new data to be inserted is dependent upon the
    > value of the 6th column.
    >
    > example original data:
    > 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
    > 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
    > 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
    > 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1
    >
    > i want to put "0" after the 6th column if the 6th column contains
    > "hold.bmp".
    > i want to put "-1" after the 6th column if the 6th column contains
    > "NaN".
    >
    > i thought i could do this with two substitutions commands:
    >
    > s/^((.*?,){5}?(hold.bmp))/$1,0/
    > s/^((.*?,){5}?(NaN))/$1,-1/
    >
    > i cannot limit the matching of "hold.bmp" or "NaN". i want this
    > pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
    > 5th column.


    Limiting to a fixed number of occurrences while using '.*' is
    contradictory, irrespective of greediness. Besides a few other things, I
    believe that the most important change you should make is to get rid of
    that problem by replacing the '.' meta character with the character
    class '[^,]'. This might do it, using only one substitution:

    s/^((?:[^,]*,){5}(?:(hold\.bmp)|NaN))/"$1,".($2 ? '0' : '-1')/e;

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Nov 11, 2004
    #4
  5. bettyann

    Anno Siegel Guest

    bettyann <> wrote in comp.lang.perl.misc:
    > hi all,
    >
    > can anyone help me limit the greediness of my substitution pattern? i
    > have a CSV file and i want to insert a new column of values after the
    > 6th column. but the new data to be inserted is dependent upon the
    > value of the 6th column.
    >
    > example original data:
    > 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
    > 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
    > 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
    > 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1
    >
    > i want to put "0" after the 6th column if the 6th column contains
    > "hold.bmp".
    > i want to put "-1" after the 6th column if the 6th column contains
    > "NaN".
    >
    > i thought i could do this with two substitutions commands:
    >
    > s/^((.*?,){5}?(hold.bmp))/$1,0/
    > s/^((.*?,){5}?(NaN))/$1,-1/
    >
    > i cannot limit the matching of "hold.bmp" or "NaN". i want this
    > pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
    > 5th column.


    [code appreciated, but snipped]

    I'd use split and splice for that, not a regex (except that split also
    uses a regex). Then you can comfortably look at the preceding field
    and decide what goes after it. For instance:

    while ( <DATA> ) {
    my @l = split /,/;
    splice @l, 6, 0, $l[ 5] eq 'hold.bmp' ? 0 : -1;
    print join ',', @l;
    }

    Anno
    Anno Siegel, Nov 11, 2004
    #5
  6. On Wed, 10 Nov 2004 19:38:35 -0800, bettyann wrote:

    > can anyone help me limit the greediness of my substitution pattern? i
    > have a CSV file and i want to insert a new column of values after the
    > 6th column. but the new data to be inserted is dependent upon the
    > value of the 6th column.


    Well, when talking about handling CSV files, why not using one of the
    numerous modules on CPAN (http://search.cpan.org?query=CSV)
    E.g. with Text::CSV_XS the following snippet works without to be worried
    about parsing csv:

    #!/usr/bin/perl

    use strict;
    use warnings;

    use Text::CSV_XS;

    my $csv = Text::CSV_XS->new();
    while (<DATA>) {
    chomp;
    $csv->parse($_) or die "Couldn't parse '$_' as CSV";
    my @col = $csv->fields;
    $csv->combine(@col[0..5],($col[5] eq 'hold.bmp' ? 0 : -1),@col[6..$#col]);
    print $csv->string,"\n";
    }

    __DATA__
    2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
    1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
    5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
    8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1


    Greetings,
    Janek
    Janek Schleicher, Nov 11, 2004
    #6
  7. bettyann

    bettyann Guest

    thanks to everyone who replied -- all suggestions are good.

    stuart and gunnar -- using pattern ([^,]*,) rather than (.*?,) works
    as i need. i understand now that i need to use a pattern that
    describes the negative of what i want rather than a pattern that
    describes what i *do* want. thanks for the suggestion and the new way
    of thinking.

    len and anno -- i did consider using split/join but since the CSV file
    has thousands of lines, i thought maybe regexp might be faster. i'm
    not sure, tho, as i haven't done a benchmark.

    janek -- Text::CSV_XS looks really nice. i'll certainly investigate
    this package more in the future.

    one last clarification, i actually have more than two different cases,
    ie:

    s/^(([^,]*,){5}hold.bmp)/$1,0/;
    s/^(([^,]*,){5}go.bmp)/$1,1/;
    s/^(([^,]*,){5}slow.bmp)/$1,2/;
    s/^(([^,]*,){5}speed.bmp)/$1,3/;
    s/^(([^,]*,){5}NaN)/$1,-1/;

    so i don't think the "?:" combination would be as straight forward.

    thanks for all the help. greatly appreciated.
    - bettyann
    bettyann, Nov 11, 2004
    #7
  8. bettyann

    Anno Siegel Guest

    bettyann <> wrote in comp.lang.perl.misc:
    > thanks to everyone who replied -- all suggestions are good.


    [...]

    > len and anno -- i did consider using split/join but since the CSV file
    > has thousands of lines, i thought maybe regexp might be faster. i'm
    > not sure, tho, as i haven't done a benchmark.


    I don't think split will be significantly slower than a regex solution.
    While split *implies* the use of a regex for the delimiter, that is
    usually a very simple one which will predictably perform well enough.
    The rest split does is (in principle, not in detail) what a capturing
    regex does too. The performance of a pure-regex solution is much
    harder to predict.

    If anything, splice may slow it down a bit, but no more than the actual
    substitution slows down the "regex" solution. I wouldn't expect a
    significant difference between split and regex, but if there is, I'd
    expect the regex to be slower.

    > janek -- Text::CSV_XS looks really nice. i'll certainly investigate
    > this package more in the future.
    >
    > one last clarification, i actually have more than two different cases,
    > ie:
    >
    > s/^(([^,]*,){5}hold.bmp)/$1,0/;
    > s/^(([^,]*,){5}go.bmp)/$1,1/;
    > s/^(([^,]*,){5}slow.bmp)/$1,2/;
    > s/^(([^,]*,){5}speed.bmp)/$1,3/;
    > s/^(([^,]*,){5}NaN)/$1,-1/;
    >
    > so i don't think the "?:" combination would be as straight forward.


    Now this is something that's going slow it down a bit, matching n times
    for n possibilities. A hash lets you do them all in one go. Quite simple:

    my %replace = (
    'hold.bmp' => 0,
    'go.bmp' => 1,
    # ...
    NaN => -1,
    );

    Then the five substitutions could become (untested, probably more
    the spirit than the real thing)

    s/^(([^,]*,){5}([^,]*))/$1,$replace{ $2}/;

    But I can't say I like the regex you're using. Only a short regex
    is a good regex, that one is much too long. I still favor the
    split solution, if only because it works on the actual data, not
    their messy representation. The hash can be used with that too,
    in the obvious way.

    Anno
    Anno Siegel, Nov 11, 2004
    #8
  9. bettyann wrote:
    > one last clarification, i actually have more than two different cases,
    > ie:
    >
    > s/^(([^,]*,){5}hold.bmp)/$1,0/;
    > s/^(([^,]*,){5}go.bmp)/$1,1/;
    > s/^(([^,]*,){5}slow.bmp)/$1,2/;
    > s/^(([^,]*,){5}speed.bmp)/$1,3/;
    > s/^(([^,]*,){5}NaN)/$1,-1/;
    >
    > so i don't think the "?:" combination would be as straight forward.


    No, but in that case you can use a hash instead. Something like:

    my %hash = (
    'hold.bmp' => ',0',
    'go.bmp' => ',1',
    'slow.bmp' => ',2',
    'speed.bmp' => ',3',
    NaN => ',-1',
    );

    s/^((?:[^,]*,){5}([^,]+))/$1.($hash{$2} or '')/e;

    After all, parsing thousands of lines once should reasonably be faster
    than doing it six times.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Nov 11, 2004
    #9
  10. bettyann

    bettyann Guest

    > Now this is something that's going slow it down a bit, matching n times
    > for n possibilities.


    indeed.

    > A hash lets you do them all in one go. Quite simple:
    >
    > my %replace = (
    > 'hold.bmp' => 0,
    > 'go.bmp' => 1,
    > # ...
    > NaN => -1,
    > );
    >
    > Then the five substitutions could become (untested, probably more
    > the spirit than the real thing)
    >
    > s/^(([^,]*,){5}([^,]*))/$1,$replace{ $2}/;


    thanks! this works well. altho the i needed to use the $3 capture as
    a key to the hash, ie,

    s/^(([^,]*,){5}([^,]*))/$1,$replace{$3}/;

    as the key is captured with the 3rd open-parenthesis.

    gunnar, thanks, too. altho i found the "e" option in the command
    "s//e" gave me this error so i simply removed the "e":

    Scalar found where operator expected at (eval 4571) line 1, near
    "}${4}"
    (Missing operator before ${4}?)

    thanks for all the help and ideas. i've incorporated hash tables in a
    few other places in my code where they really make the logic cleaner.

    thanks!
    - bettyann
    bettyann, Nov 14, 2004
    #10
  11. bettyann wrote:
    > gunnar, thanks, too. altho i found the "e" option in the command
    > "s//e" gave me this error so i simply removed the "e":
    >
    > Scalar found where operator expected at (eval 4571) line 1, near
    > "}${4}"
    > (Missing operator before ${4}?)


    Well, Anno's and my suggestions weren't identical. The /e modifier makes
    the right side of the s/// operator expect an expression rather than a
    string, and I made use of that to prevent changes (and warnings) for
    possible lines whose sixth column don't match any of the hash keys. Only
    you can tell what exactly you need.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Nov 14, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sam Pointon

    regexp non-greedy matching bug?

    Sam Pointon, Dec 4, 2005, in forum: Python
    Replies:
    8
    Views:
    353
    Fredrik Lundh
    Dec 5, 2005
  2. Tim Peters

    Re: regexp non-greedy matching bug?

    Tim Peters, Dec 4, 2005, in forum: Python
    Replies:
    0
    Views:
    383
    Tim Peters
    Dec 4, 2005
  3. Dave Rose

    too greedy of a regexp

    Dave Rose, Nov 9, 2006, in forum: Ruby
    Replies:
    3
    Views:
    99
    Dave Rose
    Nov 9, 2006
  4. Dan Kelly

    Greedy and non greedy quantifiers

    Dan Kelly, Jan 17, 2008, in forum: Ruby
    Replies:
    4
    Views:
    136
    Robert Klemme
    Jan 19, 2008
  5. Matt Garrish

    greedy v. non-greedy matching

    Matt Garrish, Feb 16, 2004, in forum: Perl Misc
    Replies:
    4
    Views:
    152
    Matt Garrish
    Feb 16, 2004
Loading...

Share This Page