expression specific search and replace

Discussion in 'Perl Misc' started by qanda, Sep 4, 2003.

  1. qanda

    qanda Guest

    Hi all

    I've just started with Perl again and would like some help with the following.
    I have files that contain records like the following (I've used comma as the
    delimiter but in real life it is octal 177)...

    field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
    field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
    field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6

    I want to find a pattern such as /C\/w+/ (I belive?) and then replace it
    with string_patternNumber. Each different pattern that is found would be
    assigned an incremental number and each pattern would then be replaced by
    a text string plus the pattern number. The pattern can appear any number
    of times in a record.

    So we could end up with something like ...

    field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
    field1,XBC/string_1 ef34,field3,field4,EFC/string_2 ef56,field6
    field1,YBC/string_1 ef34,field3,field4,EFC/string_3 ef56,field6


    My other problem is with modifying ARGV after doing a readdir with grep.
    I want to match a subset of several similar file patterns.

    aa_b_aba_kdkgh.ext
    aa_b_bcb_kdkgh.ext
    aab_b_def_kdkgh_ueyd.ext
    aa_b_abc_kdkgh.ext
    aab_b_abc_kdkgh_kdkdk.ext
    aab_b_gag_kdkgh.ext
    aab_b_abc_kdkgh.ext
    aab_b_abc_kdkgh.ext

    so the aa.?_ part is common at the beginning and the _.+\.ext is common at the
    end, but I only want aba, def and gag in the middle.

    Any help is greatly appreciated.

    Thanks.
    qanda, Sep 4, 2003
    #1
    1. Advertising

  2. qanda

    Vlad Tepes Guest

    qanda <> wrote:

    > Hi all
    >
    > I've just started with Perl again and would like some help with the
    > following. I have files that contain records like the following (I've
    > used comma as the delimiter but in real life it is octal 177)...
    >
    > field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
    > field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
    > field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
    >
    > I want to find a pattern such as /C\/w+/ (I belive?) and then replace
    > it with string_patternNumber. Each different pattern that is found
    > would be assigned an incremental number and each pattern would then be
    > replaced by a text string plus the pattern number. The pattern can
    > appear any number of times in a record.
    >
    > So we could end up with something like ...
    >
    > field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
    > field1,XBC/string_1 ef34,field3,field4,EFC/string_2 ef56,field6
    > field1,YBC/string_1 ef34,field3,field4,EFC/string_3 ef56,field6


    ^^^
    I'll assume these are to be incremented also

    How about:

    #!/usr/bin/perl

    my $count = 0;
    while ( <DATA> ) {
    $count++;
    s#(?<=C/)\w+#string_$count#g;
    print;
    }

    __DATA__
    field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
    field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
    field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6

    ( Output:

    field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
    field1,XBC/string_2 ef34,field3,field4,EFC/string_2 ef56,field6
    field1,YBC/string_3 ef34,field3,field4,EFC/string_3 ef56,field6
    )


    > My other problem is with modifying ARGV after doing a readdir with grep.
    > I want to match a subset of several similar file patterns.
    >
    > aa_b_aba_kdkgh.ext
    > aa_b_bcb_kdkgh.ext
    > aab_b_def_kdkgh_ueyd.ext
    > aa_b_abc_kdkgh.ext
    > aab_b_abc_kdkgh_kdkdk.ext
    > aab_b_gag_kdkgh.ext
    > aab_b_abc_kdkgh.ext
    > aab_b_abc_kdkgh.ext
    >
    > so the aa.?_ part is common at the beginning and the _.+\.ext is
    > common at the end, but I only want aba, def and gag in the middle.
    >
    > Any help is greatly appreciated.
    >
    > Thanks.


    This loops over filenames with suffix '.ext' in current directory:

    foreach ( <*.ext> ) {
    next unless /^aa.?_/; # skip unless wanted beginning
    ## next unless /_.+\.ext$/; # .. end (unneeded)
    print if /_(aba|def|gag)_/; # print if it contains _aba_, ...
    }



    Hope this helps,
    --
    Vlad
    Vlad Tepes, Sep 4, 2003
    #2
    1. Advertising

  3. qanda

    Anno Siegel Guest

    qanda <> wrote in comp.lang.perl.misc:
    > Hi all
    >
    > I've just started with Perl again and would like some help with the following.
    > I have files that contain records like the following (I've used comma as the
    > delimiter but in real life it is octal 177)...
    >
    > field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
    > field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
    > field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
    >
    > I want to find a pattern such as /C\/w+/ (I belive?) and then replace it

    ^^
    It? Your example data show the pattern unchanged. You seem to be
    replacing what is between "/" and the following blank.

    > with string_patternNumber. Each different pattern that is found would be
    > assigned an incremental number and each pattern would then be replaced by
    > a text string plus the pattern number. The pattern can appear any number
    > of times in a record.


    Since both the patterns you want to match and the strings you want to
    replace vary in your data, it is hard to determine when the count for
    what should go up. I am ignoring your imprecise description and going
    with the example.

    > So we could end up with something like ...
    >
    > field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
    > field1,XBC/string_1 ef34,field3,field4,EFC/string_2 ef56,field6
    > field1,YBC/string_1 ef34,field3,field4,EFC/string_3 ef56,field6


    my %count;
    while ( <DATA> ) {
    s{(..C/)\w+}{ $count{ $1}++; "$1string_$count{ $1}"}eg;
    print;
    }


    > My other problem is with modifying ARGV after doing a readdir with grep.


    If you have two independent problems, it's better to start two independent
    threads.

    [snip]

    Anno
    Anno Siegel, Sep 4, 2003
    #3
  4. qanda <> wrote:

    > I want to find a pattern such as /C\/w+/ (I belive?) and then replace it
    > with string_patternNumber. Each different pattern that is found would be
    > assigned an incremental number and each pattern would then be replaced by
    > a text string plus the pattern number. The pattern can appear any number
    > of times in a record.



    ----------------------------------
    #!/usr/bin/perl
    use strict;
    use warnings;

    my %seen;
    while ( <DATA> ) {
    s#([^,]*C/)\S+# $1 . 'string_' . ++$seen{$1} #ge;
    print;
    }


    __DATA__
    field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
    field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
    field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
    ----------------------------------


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Sep 4, 2003
    #4
  5. qanda

    qanda Guest

    Sorry for not being precise.

    The expression I gave was an (obviously wrong) guess.

    The pattern I want to look for is an uppercase letter C, followed by a
    forward slash, followed by alphanumeric characters, the pattern can
    start at the beginning of a field or in the middle and it ends at
    whitespace following alphanumeric characters of the end of field
    character. The pattern can be in any field position such as field 1,
    field 3, field n, etc and can be in 0 or more fields in one record.

    The data itself is spread over 50,000 to 500,000 files, each
    containing several hundred thousand records. These could contain say
    100,000 unique strings that match this pattern, for example
    C/abc
    C/Abc
    C/1DE

    Every occurance of C/abc should be replaced by string_1, every
    occurance of
    C/Abc should be replaced by string_2, etc.

    I think that I want the following (in pseudocode) but would appreciate
    an example of this
    or something better considering the performance running against
    millions of
    records. I would assume it makes sense to use a hash to store each
    pattern
    found and then a search and replace with a counter ...

    for all files matching file specification
    open a file
    read a record
    for each PATTERN in record
    if PATTERN exists in the pattern hash
    replace the part that matched with
    string_patternNumber
    else
    add PATTERN to hash
    endif
    endfor
    endfor

    This may be nonsense so feel free to beat me up for it! However I
    hope it
    explains the problem a bit better.

    Thanks.
    qanda, Sep 4, 2003
    #5
  6. qanda

    John Bokma Guest

    qanda wrote:

    > Sorry for not being precise.
    >
    > The expression I gave was an (obviously wrong) guess.
    >
    > The pattern I want to look for is an uppercase letter C, followed by a
    > forward slash, followed by alphanumeric characters, the pattern can
    > start at the beginning of a field or in the middle and it ends at
    > whitespace following alphanumeric characters of the end of field
    > character. The pattern can be in any field position such as field 1,
    > field 3, field n, etc and can be in 0 or more fields in one record.



    s|(C/[a-z0-9]+)| $hash{$1} |gie;

    > The data itself is spread over 50,000 to 500,000 files, each
    > containing several hundred thousand records. These could contain say
    > 100,000 unique strings that match this pattern, for example
    > C/abc
    > C/Abc
    > C/1DE
    >
    > Every occurance of C/abc should be replaced by string_1, every
    > occurance of
    > C/Abc should be replaced by string_2, etc.


    I assume you mean a look up table?

    > I think that I want the following (in pseudocode) but would appreciate
    > an example of this
    > or something better considering the performance running against
    > millions of
    > records. I would assume it makes sense to use a hash to store each
    > pattern
    > found and then a search and replace with a counter ...
    >
    > for all files matching file specification
    > open a file
    > read a record
    > for each PATTERN in record
    > if PATTERN exists in the pattern hash
    > replace the part that matched with
    > string_patternNumber
    > else
    > add PATTERN to hash


    and then??

    Ah, ok, something like:


    open(FILE, ...) or die ...

    my $number = 1;
    while (defined($line = <FILE>)) {

    $line =~ s{(C/[a-z0-9]+)}{
    defined $hash{$1} ? $hash{$1} : $hash{$1} = $number++
    }gie;

    print $line; # guess
    }
    close(FILE) or die ....

    The g means global (for each on the current line)
    The i means ignore case
    The e means the "replace" part can be an expression

    The string_1 is still not clear to me.


    --
    Kind regards, feel free to mail: mail(at)johnbokma.com (or reply)
    virtual home: http://johnbokma.com/ ICQ: 218175426
    John web site hints: http://johnbokma.com/websitedesign/
    John Bokma, Sep 4, 2003
    #6
  7. qanda

    qanda Guest

    Thanks Tad, as always you make me look at things in a different way.

    If we extend the data ...
    __DATA__
    field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
    field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
    field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab13cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6

    The result is ...

    field1,ABC/string_1 ef34,field3,field4,EFC/string_1 ef56,field6
    field1,XBC/string_1 ef34,field3,field4,EFC/string_2 ef56,field6
    field1,YBC/string_1 ef34,field3,field4,EFC/string_3 ef56,field6
    field1,YBC/string_2 ef34,field3,field4,EFC/string_4 ef56,field6
    field1,YBC/string_3 ef34,field3,field4,EFC/string_5 ef56,field6
    field1,YBC/string_4 ef34,field3,field4,EFC/string_6 ef56,field6
    field1,YBC/string_5 ef34,field3,field4,EFC/string_7 ef56,field6
    field1,YBC/string_6 ef34,field3,field4,EFC/string_8 ef56,field6

    However the unique parts and their replacements should be ...

    all C/ab12cd replaced by string_1
    all C/ab13cd replaced by string_2
    all C/ab13ce replaced by string_3
    all C/ab14cd replaced by string_4

    Thanks again.
    qanda, Sep 5, 2003
    #7
  8. qanda <> wrote:
    > Thanks Tad, as always you make me look at things in a different way.
    >
    > If we extend the data ...
    > __DATA__
    > field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
    > field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
    > field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
    > field1,YBC/ab13cd ef34,field3,field4,EFC/ab13ce ef56,field6
    > field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    > field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    > field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    > field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6



    > However the unique parts and their replacements should be ...
    >
    > all C/ab12cd replaced by string_1
    > all C/ab13cd replaced by string_2
    > all C/ab13ce replaced by string_3
    > all C/ab14cd replaced by string_4



    my %seen;
    my $cnt;
    while ( <DATA> ) {
    s#C/(\S+)# $seen{$1} = ++$cnt unless $seen{$1}; "C/string_$seen{$1}" #ge;
    print;
    }



    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Sep 5, 2003
    #8
  9. qanda

    John Bokma Guest

    qanda wrote:

    > However the unique parts and their replacements should be ...
    >
    > all C/ab12cd replaced by string_1
    > all C/ab13cd replaced by string_2
    > all C/ab13ce replaced by string_3
    > all C/ab14cd replaced by string_4


    #!/usr/bin/perl -w

    use strict;

    my %hash;
    my $cnt = 1;

    while (my $line = <DATA>) {

    $line =~ s{(C/\S+)}{
    defined $hash{$1} ? $hash{$1} :
    ($hash{$1} = "string_" . $cnt++);
    }ge;
    print $line;

    }


    __DATA__
    field1,ABC/ab12cd ef34,field3,field4,EFC/ab12cd ef56,field6
    field1,XBC/ab12cd ef34,field3,field4,EFC/ab13cd ef56,field6
    field1,YBC/ab12cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab13cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6
    field1,YBC/ab14cd ef34,field3,field4,EFC/ab13ce ef56,field6


    Gives:

    field1,ABstring_1 ef34,field3,field4,EFstring_1 ef56,field6
    field1,XBstring_1 ef34,field3,field4,EFstring_2 ef56,field6
    field1,YBstring_1 ef34,field3,field4,EFstring_3 ef56,field6
    field1,YBstring_2 ef34,field3,field4,EFstring_3 ef56,field6
    field1,YBstring_4 ef34,field3,field4,EFstring_3 ef56,field6
    field1,YBstring_4 ef34,field3,field4,EFstring_3 ef56,field6
    field1,YBstring_4 ef34,field3,field4,EFstring_3 ef56,field6
    field1,YBstring_4 ef34,field3,field4,EFstring_3 ef56,field6



    --
    Kind regards, feel free to mail: mail(at)johnbokma.com (or reply)
    virtual home: http://johnbokma.com/ ICQ: 218175426
    John web site hints: http://johnbokma.com/websitedesign/
    John Bokma, Sep 5, 2003
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. TB
    Replies:
    2
    Views:
    6,351
  2. sebb
    Replies:
    2
    Views:
    325
    Miki Tebeka
    Jan 12, 2004
  3. Jimmy
    Replies:
    25
    Views:
    772
    Jeff Higgins
    May 26, 2010
  4. William FERRERES
    Replies:
    7
    Views:
    217
    William FERRERES
    Jul 9, 2007
  5. Replies:
    1
    Views:
    518
    Rainer Weikusat
    Jun 21, 2012
Loading...

Share This Page