regexp s// too greedy

B

bettyann

hi all,

can anyone help me limit the greediness of my substitution pattern? i
have a CSV file and i want to insert a new column of values after the
6th column. but the new data to be inserted is dependent upon the
value of the 6th column.

example original data:
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1

i want to put "0" after the 6th column if the 6th column contains
"hold.bmp".
i want to put "-1" after the 6th column if the 6th column contains
"NaN".

i thought i could do this with two substitutions commands:

s/^((.*?,){5}?(hold.bmp))/$1,0/
s/^((.*?,){5}?(NaN))/$1,-1/

i cannot limit the matching of "hold.bmp" or "NaN". i want this
pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
5th column.

my test code:
#!/usr/local/bin/perl

use strict;
use warnings;

my $input = <<EOF;
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1
EOF

my @oData = split( '\n', $input );
my $line;
my $cnt = 0;
foreach $line ( @oData ) {
printf( "$cnt) $line \n" );
$cnt++;
}

my $prevCol = 5;
my @txtList = ( "hold.bmp", "NaN" );
my @valList = ( "0", "-1" );
my ( $txt, $cmd, $i );
$i = 0;
foreach $txt ( @txtList ) {
$cmd = sprintf( '$line =~ s/^((.*?,){%d}?(%s))/$1,%s/;',
$prevCol, $txt, $valList[$i] );
printf( "\ncmd >>$cmd<< \n" );
foreach $line ( @oData ) {
printf( "orig line |$line| \n" );
eval $cmd;
printf( " new line |$line| \n---------------------\n" );
}
$i++;
}

exit;

output:
% test2.pl
0) 2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
1) 1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
2) 5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
3) 8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1

cmd >>$line =~ s/^((.*?,){5}?(hold.bmp))/$1,0/;<<
orig line |2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1|
new line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,NaN,NaN,hold.bmp,NaN,1|
---------------------
orig line |1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1|
new line |1,NaN,NaN,NaN,32,hold.bmp,0,1607488,NaN,NaN,NaN,hold.bmp,3,1|
---------------------
orig line |5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1|
new line |5,NaN,NaN,4,32,hold.bmp,0,1607503,NaN,NaN,8,go.bmp,NaN,1|
---------------------
orig line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1|
new line |8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,0,NaN,1|
---------------------

cmd >>$line =~ s/^((.*?,){5}?(NaN))/$1,-1/;<<
orig line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,NaN,NaN,hold.bmp,NaN,1|
new line |2,NaN,NaN,NaN,64,hold.bmp,0,1607444,NaN,-1,NaN,NaN,hold.bmp,NaN,1|
 
S

Stuart Moore

bettyann said:
hi all,

can anyone help me limit the greediness of my substitution pattern? i
have a CSV file and i want to insert a new column of values after the
6th column. but the new data to be inserted is dependent upon the
value of the 6th column.

example original data:
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1

i want to put "0" after the 6th column if the 6th column contains
"hold.bmp".
i want to put "-1" after the 6th column if the 6th column contains
"NaN".

i thought i could do this with two substitutions commands:

s/^((.*?,){5}?(hold.bmp))/$1,0/
s/^((.*?,){5}?(NaN))/$1,-1/

^ Not sure that you want that ?

I suggest replacing (.*?,) with ([^,]*) assuming there isn't some way of
commas appearing escaped within the data.
 
S

Stuart Moore

Stuart said:
I suggest replacing (.*?,) with ([^,]*) assuming there isn't some way of
commas appearing escaped within the data.

That should have been ([^,]*,) of course
 
G

Gunnar Hjalmarsson

bettyann said:
can anyone help me limit the greediness of my substitution pattern? i
have a CSV file and i want to insert a new column of values after the
6th column. but the new data to be inserted is dependent upon the
value of the 6th column.

example original data:
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1

i want to put "0" after the 6th column if the 6th column contains
"hold.bmp".
i want to put "-1" after the 6th column if the 6th column contains
"NaN".

i thought i could do this with two substitutions commands:

s/^((.*?,){5}?(hold.bmp))/$1,0/
s/^((.*?,){5}?(NaN))/$1,-1/

i cannot limit the matching of "hold.bmp" or "NaN". i want this
pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
5th column.

Limiting to a fixed number of occurrences while using '.*' is
contradictory, irrespective of greediness. Besides a few other things, I
believe that the most important change you should make is to get rid of
that problem by replacing the '.' meta character with the character
class '[^,]'. This might do it, using only one substitution:

s/^((?:[^,]*,){5}(?:(hold\.bmp)|NaN))/"$1,".($2 ? '0' : '-1')/e;
 
A

Anno Siegel

bettyann said:
hi all,

can anyone help me limit the greediness of my substitution pattern? i
have a CSV file and i want to insert a new column of values after the
6th column. but the new data to be inserted is dependent upon the
value of the 6th column.

example original data:
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1

i want to put "0" after the 6th column if the 6th column contains
"hold.bmp".
i want to put "-1" after the 6th column if the 6th column contains
"NaN".

i thought i could do this with two substitutions commands:

s/^((.*?,){5}?(hold.bmp))/$1,0/
s/^((.*?,){5}?(NaN))/$1,-1/

i cannot limit the matching of "hold.bmp" or "NaN". i want this
pattern to match *only* if "hold.bmp" or "NaN" immediately follows the
5th column.

[code appreciated, but snipped]

I'd use split and splice for that, not a regex (except that split also
uses a regex). Then you can comfortably look at the preceding field
and decide what goes after it. For instance:

while ( <DATA> ) {
my @l = split /,/;
splice @l, 6, 0, $l[ 5] eq 'hold.bmp' ? 0 : -1;
print join ',', @l;
}

Anno
 
J

Janek Schleicher

can anyone help me limit the greediness of my substitution pattern? i
have a CSV file and i want to insert a new column of values after the
6th column. but the new data to be inserted is dependent upon the
value of the 6th column.

Well, when talking about handling CSV files, why not using one of the
numerous modules on CPAN (http://search.cpan.org?query=CSV)
E.g. with Text::CSV_XS the following snippet works without to be worried
about parsing csv:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS;

my $csv = Text::CSV_XS->new();
while (<DATA>) {
chomp;
$csv->parse($_) or die "Couldn't parse '$_' as CSV";
my @col = $csv->fields;
$csv->combine(@col[0..5],($col[5] eq 'hold.bmp' ? 0 : -1),@col[6..$#col]);
print $csv->string,"\n";
}

__DATA__
2,NaN,NaN,NaN,64,hold.bmp,1607444,NaN,NaN,NaN,hold.bmp,NaN,1
1,NaN,NaN,NaN,32,hold.bmp,1607488,NaN,NaN,NaN,hold.bmp,3,1
5,NaN,NaN,4,32,hold.bmp,1607503,NaN,NaN,8,go.bmp,NaN,1
8,NaN,NaN,4,32,NaN,1607564,NaN,NaN,8,hold.bmp,NaN,1


Greetings,
Janek
 
B

bettyann

thanks to everyone who replied -- all suggestions are good.

stuart and gunnar -- using pattern ([^,]*,) rather than (.*?,) works
as i need. i understand now that i need to use a pattern that
describes the negative of what i want rather than a pattern that
describes what i *do* want. thanks for the suggestion and the new way
of thinking.

len and anno -- i did consider using split/join but since the CSV file
has thousands of lines, i thought maybe regexp might be faster. i'm
not sure, tho, as i haven't done a benchmark.

janek -- Text::CSV_XS looks really nice. i'll certainly investigate
this package more in the future.

one last clarification, i actually have more than two different cases,
ie:

s/^(([^,]*,){5}hold.bmp)/$1,0/;
s/^(([^,]*,){5}go.bmp)/$1,1/;
s/^(([^,]*,){5}slow.bmp)/$1,2/;
s/^(([^,]*,){5}speed.bmp)/$1,3/;
s/^(([^,]*,){5}NaN)/$1,-1/;

so i don't think the "?:" combination would be as straight forward.

thanks for all the help. greatly appreciated.
- bettyann
 
A

Anno Siegel

bettyann said:
thanks to everyone who replied -- all suggestions are good.
[...]

len and anno -- i did consider using split/join but since the CSV file
has thousands of lines, i thought maybe regexp might be faster. i'm
not sure, tho, as i haven't done a benchmark.

I don't think split will be significantly slower than a regex solution.
While split *implies* the use of a regex for the delimiter, that is
usually a very simple one which will predictably perform well enough.
The rest split does is (in principle, not in detail) what a capturing
regex does too. The performance of a pure-regex solution is much
harder to predict.

If anything, splice may slow it down a bit, but no more than the actual
substitution slows down the "regex" solution. I wouldn't expect a
significant difference between split and regex, but if there is, I'd
expect the regex to be slower.
janek -- Text::CSV_XS looks really nice. i'll certainly investigate
this package more in the future.

one last clarification, i actually have more than two different cases,
ie:

s/^(([^,]*,){5}hold.bmp)/$1,0/;
s/^(([^,]*,){5}go.bmp)/$1,1/;
s/^(([^,]*,){5}slow.bmp)/$1,2/;
s/^(([^,]*,){5}speed.bmp)/$1,3/;
s/^(([^,]*,){5}NaN)/$1,-1/;

so i don't think the "?:" combination would be as straight forward.

Now this is something that's going slow it down a bit, matching n times
for n possibilities. A hash lets you do them all in one go. Quite simple:

my %replace = (
'hold.bmp' => 0,
'go.bmp' => 1,
# ...
NaN => -1,
);

Then the five substitutions could become (untested, probably more
the spirit than the real thing)

s/^(([^,]*,){5}([^,]*))/$1,$replace{ $2}/;

But I can't say I like the regex you're using. Only a short regex
is a good regex, that one is much too long. I still favor the
split solution, if only because it works on the actual data, not
their messy representation. The hash can be used with that too,
in the obvious way.

Anno
 
G

Gunnar Hjalmarsson

bettyann said:
one last clarification, i actually have more than two different cases,
ie:

s/^(([^,]*,){5}hold.bmp)/$1,0/;
s/^(([^,]*,){5}go.bmp)/$1,1/;
s/^(([^,]*,){5}slow.bmp)/$1,2/;
s/^(([^,]*,){5}speed.bmp)/$1,3/;
s/^(([^,]*,){5}NaN)/$1,-1/;

so i don't think the "?:" combination would be as straight forward.

No, but in that case you can use a hash instead. Something like:

my %hash = (
'hold.bmp' => ',0',
'go.bmp' => ',1',
'slow.bmp' => ',2',
'speed.bmp' => ',3',
NaN => ',-1',
);

s/^((?:[^,]*,){5}([^,]+))/$1.($hash{$2} or '')/e;

After all, parsing thousands of lines once should reasonably be faster
than doing it six times.
 
B

bettyann

Now this is something that's going slow it down a bit, matching n times
for n possibilities.
indeed.

A hash lets you do them all in one go. Quite simple:

my %replace = (
'hold.bmp' => 0,
'go.bmp' => 1,
# ...
NaN => -1,
);

Then the five substitutions could become (untested, probably more
the spirit than the real thing)

s/^(([^,]*,){5}([^,]*))/$1,$replace{ $2}/;

thanks! this works well. altho the i needed to use the $3 capture as
a key to the hash, ie,

s/^(([^,]*,){5}([^,]*))/$1,$replace{$3}/;

as the key is captured with the 3rd open-parenthesis.

gunnar, thanks, too. altho i found the "e" option in the command
"s//e" gave me this error so i simply removed the "e":

Scalar found where operator expected at (eval 4571) line 1, near
"}${4}"
(Missing operator before ${4}?)

thanks for all the help and ideas. i've incorporated hash tables in a
few other places in my code where they really make the logic cleaner.

thanks!
- bettyann
 
G

Gunnar Hjalmarsson

bettyann said:
gunnar, thanks, too. altho i found the "e" option in the command
"s//e" gave me this error so i simply removed the "e":

Scalar found where operator expected at (eval 4571) line 1, near
"}${4}"
(Missing operator before ${4}?)

Well, Anno's and my suggestions weren't identical. The /e modifier makes
the right side of the s/// operator expect an expression rather than a
string, and I made use of that to prevent changes (and warnings) for
possible lines whose sixth column don't match any of the hash keys. Only
you can tell what exactly you need.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top