expression specific search and replace

Q

qanda

Hi all

I've just started with Perl again and would like some help with the following.
I have files that contain records like the following (I've used comma as the
delimiter but in real life it is octal 177)...

field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6

I want to find a pattern such as /C\/w+/ (I belive?) and then replace it
with string_patternNumber. Each different pattern that is found would be
assigned an incremental number and each pattern would then be replaced by
a text string plus the pattern number. The pattern can appear any number
of times in a record.

So we could end up with something like ...

field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
field1,XBC/string_1 ef34,field3,field4,EFC/string_2 ef56,field6
field1,YBC/string_1 ef34,field3,field4,EFC/string_3 ef56,field6


My other problem is with modifying ARGV after doing a readdir with grep.
I want to match a subset of several similar file patterns.

aa_b_aba_kdkgh.ext
aa_b_bcb_kdkgh.ext
aab_b_def_kdkgh_ueyd.ext
aa_b_abc_kdkgh.ext
aab_b_abc_kdkgh_kdkdk.ext
aab_b_gag_kdkgh.ext
aab_b_abc_kdkgh.ext
aab_b_abc_kdkgh.ext

so the aa.?_ part is common at the beginning and the _.+\.ext is common at the
end, but I only want aba, def and gag in the middle.

Any help is greatly appreciated.

Thanks.
 
V

Vlad Tepes

qanda said:
Hi all

I've just started with Perl again and would like some help with the
following. I have files that contain records like the following (I've
used comma as the delimiter but in real life it is octal 177)...

field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6

I want to find a pattern such as /C\/w+/ (I belive?) and then replace
it with string_patternNumber. Each different pattern that is found
would be assigned an incremental number and each pattern would then be
replaced by a text string plus the pattern number. The pattern can
appear any number of times in a record.

So we could end up with something like ...

field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
field1,XBC/string_1 ef34,field3,field4,EFC/string_2 ef56,field6
field1,YBC/string_1 ef34,field3,field4,EFC/string_3 ef56,field6

^^^
I'll assume these are to be incremented also

How about:

#!/usr/bin/perl

my $count = 0;
while ( <DATA> ) {
$count++;
s#(?<=C/)\w+#string_$count#g;
print;
}

__DATA__
field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6

( Output:

field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
field1,XBC/string_2 ef34,field3,field4,EFC/string_2 ef56,field6
field1,YBC/string_3 ef34,field3,field4,EFC/string_3 ef56,field6
)

My other problem is with modifying ARGV after doing a readdir with grep.
I want to match a subset of several similar file patterns.

aa_b_aba_kdkgh.ext
aa_b_bcb_kdkgh.ext
aab_b_def_kdkgh_ueyd.ext
aa_b_abc_kdkgh.ext
aab_b_abc_kdkgh_kdkdk.ext
aab_b_gag_kdkgh.ext
aab_b_abc_kdkgh.ext
aab_b_abc_kdkgh.ext

so the aa.?_ part is common at the beginning and the _.+\.ext is
common at the end, but I only want aba, def and gag in the middle.

Any help is greatly appreciated.

Thanks.

This loops over filenames with suffix '.ext' in current directory:

foreach ( <*.ext> ) {
next unless /^aa.?_/; # skip unless wanted beginning
## next unless /_.+\.ext$/; # .. end (unneeded)
print if /_(aba|def|gag)_/; # print if it contains _aba_, ...
}



Hope this helps,
 
A

Anno Siegel

qanda said:
Hi all

I've just started with Perl again and would like some help with the following.
I have files that contain records like the following (I've used comma as the
delimiter but in real life it is octal 177)...

field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6

I want to find a pattern such as /C\/w+/ (I belive?) and then replace it
^^
It? Your example data show the pattern unchanged. You seem to be
replacing what is between "/" and the following blank.
with string_patternNumber. Each different pattern that is found would be
assigned an incremental number and each pattern would then be replaced by
a text string plus the pattern number. The pattern can appear any number
of times in a record.

Since both the patterns you want to match and the strings you want to
replace vary in your data, it is hard to determine when the count for
what should go up. I am ignoring your imprecise description and going
with the example.
So we could end up with something like ...

field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
field1,XBC/string_1 ef34,field3,field4,EFC/string_2 ef56,field6
field1,YBC/string_1 ef34,field3,field4,EFC/string_3 ef56,field6

my %count;
while ( <DATA> ) {
s{(..C/)\w+}{ $count{ $1}++; "$1string_$count{ $1}"}eg;
print;
}

My other problem is with modifying ARGV after doing a readdir with grep.

If you have two independent problems, it's better to start two independent
threads.

[snip]

Anno
 
T

Tad McClellan

qanda said:
I want to find a pattern such as /C\/w+/ (I belive?) and then replace it
with string_patternNumber. Each different pattern that is found would be
assigned an incremental number and each pattern would then be replaced by
a text string plus the pattern number. The pattern can appear any number
of times in a record.


----------------------------------
#!/usr/bin/perl
use strict;
use warnings;

my %seen;
while ( <DATA> ) {
s#([^,]*C/)\S+# $1 . 'string_' . ++$seen{$1} #ge;
print;
}


__DATA__
field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
 
Q

qanda

Sorry for not being precise.

The expression I gave was an (obviously wrong) guess.

The pattern I want to look for is an uppercase letter C, followed by a
forward slash, followed by alphanumeric characters, the pattern can
start at the beginning of a field or in the middle and it ends at
whitespace following alphanumeric characters of the end of field
character. The pattern can be in any field position such as field 1,
field 3, field n, etc and can be in 0 or more fields in one record.

The data itself is spread over 50,000 to 500,000 files, each
containing several hundred thousand records. These could contain say
100,000 unique strings that match this pattern, for example
C/abc
C/Abc
C/1DE

Every occurance of C/abc should be replaced by string_1, every
occurance of
C/Abc should be replaced by string_2, etc.

I think that I want the following (in pseudocode) but would appreciate
an example of this
or something better considering the performance running against
millions of
records. I would assume it makes sense to use a hash to store each
pattern
found and then a search and replace with a counter ...

for all files matching file specification
open a file
read a record
for each PATTERN in record
if PATTERN exists in the pattern hash
replace the part that matched with
string_patternNumber
else
add PATTERN to hash
endif
endfor
endfor

This may be nonsense so feel free to beat me up for it! However I
hope it
explains the problem a bit better.

Thanks.
 
J

John Bokma

qanda said:
Sorry for not being precise.

The expression I gave was an (obviously wrong) guess.

The pattern I want to look for is an uppercase letter C, followed by a
forward slash, followed by alphanumeric characters, the pattern can
start at the beginning of a field or in the middle and it ends at
whitespace following alphanumeric characters of the end of field
character. The pattern can be in any field position such as field 1,
field 3, field n, etc and can be in 0 or more fields in one record.


s|(C/[a-z0-9]+)| $hash{$1} |gie;
The data itself is spread over 50,000 to 500,000 files, each
containing several hundred thousand records. These could contain say
100,000 unique strings that match this pattern, for example
C/abc
C/Abc
C/1DE

Every occurance of C/abc should be replaced by string_1, every
occurance of
C/Abc should be replaced by string_2, etc.

I assume you mean a look up table?
I think that I want the following (in pseudocode) but would appreciate
an example of this
or something better considering the performance running against
millions of
records. I would assume it makes sense to use a hash to store each
pattern
found and then a search and replace with a counter ...

for all files matching file specification
open a file
read a record
for each PATTERN in record
if PATTERN exists in the pattern hash
replace the part that matched with
string_patternNumber
else
add PATTERN to hash

and then??

Ah, ok, something like:


open(FILE, ...) or die ...

my $number = 1;
while (defined($line = <FILE>)) {

$line =~ s{(C/[a-z0-9]+)}{
defined $hash{$1} ? $hash{$1} : $hash{$1} = $number++
}gie;

print $line; # guess
}
close(FILE) or die ....

The g means global (for each on the current line)
The i means ignore case
The e means the "replace" part can be an expression

The string_1 is still not clear to me.
 
Q

qanda

Thanks Tad, as always you make me look at things in a different way.

If we extend the data ...
__DATA__
field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab13cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6

The result is ...

field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
field1,XBC/string_1 ef34,field3,field4,EFC/string_2 ef56,field6
field1,YBC/string_1 ef34,field3,field4,EFC/string_3 ef56,field6
field1,YBC/string_2 ef34,field3,field4,EFC/string_4 ef56,field6
field1,YBC/string_3 ef34,field3,field4,EFC/string_5 ef56,field6
field1,YBC/string_4 ef34,field3,field4,EFC/string_6 ef56,field6
field1,YBC/string_5 ef34,field3,field4,EFC/string_7 ef56,field6
field1,YBC/string_6 ef34,field3,field4,EFC/string_8 ef56,field6

However the unique parts and their replacements should be ...

all C/ab12cd replaced by string_1
all C/ab13cd replaced by string_2
all C/ab13ce replaced by string_3
all C/ab14cd replaced by string_4

Thanks again.
 
T

Tad McClellan

qanda said:
Thanks Tad, as always you make me look at things in a different way.

If we extend the data ...
__DATA__
field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab13cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6

However the unique parts and their replacements should be ...

all C/ab12cd replaced by string_1
all C/ab13cd replaced by string_2
all C/ab13ce replaced by string_3
all C/ab14cd replaced by string_4


my %seen;
my $cnt;
while ( <DATA> ) {
s#C/(\S+)# $seen{$1} = ++$cnt unless $seen{$1}; "C/string_$seen{$1}" #ge;
print;
}
 
J

John Bokma

qanda said:
However the unique parts and their replacements should be ...

all C/ab12cd replaced by string_1
all C/ab13cd replaced by string_2
all C/ab13ce replaced by string_3
all C/ab14cd replaced by string_4

#!/usr/bin/perl -w

use strict;

my %hash;
my $cnt = 1;

while (my $line = <DATA>) {

$line =~ s{(C/\S+)}{
defined $hash{$1} ? $hash{$1} :
($hash{$1} = "string_" . $cnt++);
}ge;
print $line;

}


__DATA__
field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab13cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6


Gives:

field1,ABstring_1 ef34,field3,field4,EFstring_1 ef56,field6
field1,XBstring_1 ef34,field3,field4,EFstring_2 ef56,field6
field1,YBstring_1 ef34,field3,field4,EFstring_3 ef56,field6
field1,YBstring_2 ef34,field3,field4,EFstring_3 ef56,field6
field1,YBstring_4 ef34,field3,field4,EFstring_3 ef56,field6
field1,YBstring_4 ef34,field3,field4,EFstring_3 ef56,field6
field1,YBstring_4 ef34,field3,field4,EFstring_3 ef56,field6
field1,YBstring_4 ef34,field3,field4,EFstring_3 ef56,field6
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top