speed issues with pattern matching and substitution

Discussion in 'Perl Misc' started by Arturi, Apr 4, 2005.

  1. Arturi

    Arturi Guest

    Hello there,

    I wrote some perl lines that scans a file looking for some words and
    substitute them for their corresponding pairs, which are previously
    saved in a HASH.
    The code is working fine but I'm wondering what could I do to speed it
    up.
    Does anybody have a hint?

    cheers

    here the code extract:
    ----------------------------------------------------------------------
    %HASH #is being defined previously

    @keys= keys(%HASH);
    my $line = undef;
    my $tmp = undef;

    open(FILE,"< in.txt") || die "Can't open ";
    open(OUTPUT2,"< out.txt") || die "Can't open ";

    while (<FILE>) {
    $line = $_;
    foreach $k (@keys) {
    if (/(\W)$k(\W)/){
    #print "$_\n";
    $tmp = $HASH{$k};
    while ($line =~/(\W)($k)(\W)/){$line =~
    s/(\W)($k)(\W)/$1$tmp$3/;}
    }
    }
    print OUTPUT2 $line;
    }

    ---------------------------------------------------------------------------
     
    Arturi, Apr 4, 2005
    #1
    1. Advertising

  2. (Arturi) wrote in
    news::

    > The code is working fine but I'm wondering what could I do to speed it
    > up.
    > Does anybody have a hint?


    > %HASH #is being defined previously


    use strict;
    use warnings;

    missing;

    > @keys= keys(%HASH);
    > my $line = undef;
    > my $tmp = undef;


    my ($line, $tmp);

    You should declare variables in the smallest applicable scope.

    > open(FILE,"< in.txt") || die "Can't open ";


    open my $file, '<', 'in.txt' or die "Can't open in.txt: $!";

    > open(OUTPUT2,"< out.txt") || die "Can't open ";


    Why are you opening the output file for reading only?

    Also, see above.

    > while (<FILE>) {
    > $line = $_;


    while(my $line = <$file>) {

    > foreach $k (@keys) {
    > if (/(\W)$k(\W)/){
    > #print "$_\n";
    > $tmp = $HASH{$k};
    > while ($line =~/(\W)($k)(\W)/){$line =~
    > s/(\W)($k)(\W)/$1$tmp$3/;}


    What is the role of the while there?

    > }
    > }
    > print OUTPUT2 $line;
    > }


    I have a sneaking suspicion you have not posted real code.

    Please read the posting guidelines for this group for help on how to
    help others help you.

    Please post real code.

    Sinan.

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 4, 2005
    #2
    1. Advertising

  3. Arturi

    Guest

    (Arturi) wrote:
    > Hello there,
    >
    > I wrote some perl lines that scans a file looking for some words and
    > substitute them for their corresponding pairs, which are previously
    > saved in a HASH.
    > The code is working fine but I'm wondering what could I do to speed it
    > up.
    > Does anybody have a hint?


    Yes. Post real code. For example, opening an output file for reading only
    is probably not real code. Describe the input or better give some short
    code which generates acceptable fake input for testing. Tell us how long
    it took to run. Tell us how much improvement you are looking for.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 4, 2005
    #3
  4. Arturi

    Greg K Guest

    I am more of a lurker here than a Perl guru(someday I hope), but my
    guess would be the three comparisons for a match of your key are what
    is slowing you down:

    >if (/(\W)$k(\W)/){
    >$line =~ /(\W)($k)(\W)/
    >$line =~ s/(\W)($k)(\W)/$1$tmp$3/;


    I use a similar program for updating a bunch of files in a directory. I
    wrote it so you could use two arrays which contain old and new values,
    but I just modified it to use a hash instead. I know my code probably
    isn't as concise as some people here can write, but you're welcome to
    use if you like. It actually scans a directory listing for files which
    match a regexp, then processes those.

    [CodeA - Arrays]
    #!/usr/local/bin/perl
    use strict;
    use warnings;

    #directory info and file reg exp
    my $dir = opendir DIR,"../old";
    my @files = readdir(DIR);
    my $fileregexp = "txt";

    # number of files processed
    my $count=0;

    #arrays of stuff you want to change
    my @old;
    my @new;

    ## old will get interpreted as a regexp so make sure you escape chars
    as needed
    $old[0] = 'C';
    $new[0] = 'Perl';

    foreach my $filename (@files)
    {
    if( $filename =~ /$fileregexp/ )
    {
    $count++;
    open IN, "<../old/$filename" || die "Unable to open files for
    reading";
    open OUT, ">../new/$filename"|| die "Unable to open files for
    writing";
    while ( <IN> )
    {
    for( my $t = 0; $t < ($#old + 1); $t++ )
    {
    s/$old[$t]/$new[$t]/g;
    }
    print OUT $_;
    }
    close IN;
    close OUT;
    print "../new/$filename created\n";
    }
    }
    print "$count files modified\n";
    [/CodeA]


    [CodeB - Hash]
    #!/usr/local/bin/perl
    use strict;
    use warnings;

    #directory info and file reg exp
    my $dir = opendir DIR,"../old";
    my @files = readdir(DIR);
    my $fileregexp = "txt";

    # number of files processed
    my $count=0;

    ## the key will be interpreted as a regexp so escape anything you need
    to here, such as parens
    my %changes = (
    'C' => 'Perl'
    );

    my @keys = keys( %changes );

    foreach my $filename ( @files )
    {
    if( $filename =~ /$fileregexp/ )
    {
    $count++;
    open IN, "<../old/$filename" || die "Unable to open files for
    reading";
    open OUT, ">../new/$filename" || die "Unable to open files for
    writing";
    while( <IN> )
    {
    foreach my $k ( @keys )
    {
    s/$k/$changes{$k}/g;
    }
    print OUT $_;
    }
    close IN;
    close OUT;
    print "../new/$filename created\n";
    }
    }
    print "$count files modified\n";
    [/CodeB]

    Data input file(essay.txt):
    C is the best programming language.
    There is more than one way to do it in C.
    I like all the premade data structures in C.

    Data output file(essay.txt):
    Perl is the best programming language.
    There is more than one way to do it in Perl.
    I like all the premade data structures in Perl.

    -Greg
     
    Greg K, Apr 4, 2005
    #4
  5. Arturi wrote:
    >
    > I wrote some perl lines that scans a file looking for some words and
    > substitute them for their corresponding pairs, which are previously
    > saved in a HASH.
    > The code is working fine but I'm wondering what could I do to speed it
    > up.


    Use a profiling module or the Benchmark module to determine which parts of the
    code are too slow.


    > here the code extract:
    > ----------------------------------------------------------------------
    > %HASH #is being defined previously
    >
    > @keys= keys(%HASH);
    > my $line = undef;
    > my $tmp = undef;
    >
    > open(FILE,"< in.txt") || die "Can't open ";
    > open(OUTPUT2,"< out.txt") || die "Can't open ";
    >
    > while (<FILE>) {
    > $line = $_;
    > foreach $k (@keys) {
    > if (/(\W)$k(\W)/){


    Why use \W which requires a character at either end of $k instead of \b which
    doesn't? Why use capturing parentheses?


    > #print "$_\n";
    > $tmp = $HASH{$k};
    > while ($line =~/(\W)($k)(\W)/){$line =~
    > s/(\W)($k)(\W)/$1$tmp$3/;}
    > }


    More simply written as:

    1 while $line =~ s/(\W)($k)(\W)/$1$tmp$3/;

    But the only reason to do it like that is if the keys/values are overlapping
    or recursive. Is there any reason that you can't just use the /g global option?


    > }
    > print OUTPUT2 $line;
    > }
    >
    > ---------------------------------------------------------------------------



    open FILE, '<', 'in.txt' or die "Can't open 'in.txt' $!";
    open OUTPUT2, '>', 'out.txt' or die "Can't open 'out.txt' $!";

    while ( my $line = <FILE> ) {
    for my $k ( keys %HASH ) {
    $line =~ s/\b($k)\b/$HASH{$k}/g;

    # if you REALLY need it
    # 1 while $line =~ s/\b($k)\b/$HASH{$k}/g;
    }
    print OUTPUT2 $line;
    }



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Apr 4, 2005
    #5
  6. Arturi wrote:

    > I wrote some perl lines that scans a file looking for some words and
    > substitute them for their corresponding pairs, which are previously
    > saved in a HASH.
    > The code is working fine but I'm wondering what could I do to speed it
    > up.
    > Does anybody have a hint?


    You are writing something of the form:

    if( some_condition() ) {
    do_something();
    }

    Where do_something() will not do anything anyhow if some_condition() is
    not met. As such it would be simpler just to write:

    do_something();

    Actually you've gone one step further and written:

    if( some_condition() ) {
    while( some_condition() ) {
    do_something();
    }
    }

    Where do_something() will not only not do anything if some_condition()
    is not met but will also return false. As such it would be simpler just
    to write:

    while( do_something() ) {}

    Now if we actually look a the something, in quesion it's a s///. Do you
    really want to to be able to seach and replace again within the previous
    replacements? Looking at your code I'd guess that wouldn't make sense.
    And if that's the case then you could simply use s///g.

    > here the code extract:
    > ----------------------------------------------------------------------
    > %HASH #is being defined previously
    >
    > @keys= keys(%HASH);
    > my $line = undef;
    > my $tmp = undef;


    Nasty case of premature declaration you've got there.

    > open(FILE,"< in.txt") || die "Can't open ";
    > open(OUTPUT2,"< out.txt") || die "Can't open ";
    >
    > while (<FILE>) {
    > $line = $_;


    If you are worried abot speed why are you doing that?

    > foreach $k (@keys) {
    > if (/(\W)$k(\W)/){
    > #print "$_\n";
    > $tmp = $HASH{$k};


    Awful name for a variable, $tmp.

    > while ($line =~/(\W)($k)(\W)/){$line =~
    > s/(\W)($k)(\W)/$1$tmp$3/;}
    > }
    > }


    If you are concerned about speed you should take the comoposition and
    compilatiion of the regex outside the loop.

    my @patterns = map { qr/(\W)($_)(\W)/ } keys %HASH;

    local *_; # Wise not to stomp on someone else's $_

    while (<FILE>) {
    my @values = values %HASH;
    for my $pattern ( @patterns ) {
    my $value = shift @values;
    s/$pattern/$1$value$3/g;
    }
    # print or whatever
    }

    But my spider-sense is telling me that perhaps the keys of %HASH really
    arbirary regexes but are just words and all you want is word
    substitution. This is a classic bit of Perl.

    while (<FILE>) {
    s/(\w+)/$HASH{$1}||$1/eg;
    # print or whatever
    }
     
    Brian McCauley, Apr 4, 2005
    #6
  7. Greg K <> wrote:

    > I am more of a lurker here than a Perl guru(someday I hope),



    Use the FAQ Luke!

    perldoc -q match

    How do I efficiently match many regular expressions at once?


    > but my
    > guess



    There is no need to guess:

    perldoc -q profile

    How do I profile my Perl programs?


    > I use a similar program



    I sure hope that does not mean the the code below is what
    you have actually used.

    If it is, then you just haven't found the bugs that are in it yet...


    > #!/usr/local/bin/perl
    > use strict;
    > use warnings;



    A most excellent start, BTW.


    > my $dir = opendir DIR,"../old";



    You should check the return value to see if you actually got
    what you asked for, just like with open() below.


    > $old[0] = 'C';
    > $new[0] = 'Perl';



    > open IN, "<../old/$filename" || die "Unable to open files for
    > reading";



    You should include the $! variable in your die message, it
    contains the reason for the dying.


    > for( my $t = 0; $t < ($#old + 1); $t++ )



    Gak!

    foreach my $t ( 0 .. $#old )

    or at least:

    for( my $t = 0; $t < @old; $t++ )

    or even:

    for( my $t = 0; $t <= $#old; $t++ )


    > s/$old[$t]/$new[$t]/g;



    If

    $_ = 'Chuck likes C';

    Then you _want_ it to be changed to:

    Perlhuck likes Perl

    ??


    Probably not, so maybe this instead:

    s/\b$old[$t]\b/$new[$t]/g;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 5, 2005
    #7
  8. Arturi

    Greg K Guest

    >Use the FAQ Luke!
    > perldoc -q match
    > How do I efficiently match many regular expressions at once?


    Nice, didn't know about that. Thanks for the tip!

    >I sure hope that does not mean the the code below is what
    >you have actually used.
    >If it is, then you just haven't found the bugs that are in it yet...


    Actually, yes, I do use the following.

    >for( my $t = 0; $t < @old; $t++ )
    >for( my $t = 0; $t <= $#old; $t++ )


    I would use either of these this one. Unfortunately I still have to
    switch back and forth between languages frequently so I probably won't
    switch to the ( 0 .. $#old ) notation any time soon.

    >> s/$old[$t]/$new[$t]/g;

    >If
    > $_ = 'Chuck likes C';
    >Then you _want_ it to be changed to:
    > Perlhuck likes Perl


    Actually, for my case it does what I want. I am matching long unique
    strings so I don't have to worry about accidental matches(strings such
    as "../group_name/participants.html", etc). I figured the OP can write
    the regexp for their specific need.

    Anyway, thx for the constructive criticism. I am going to keep reading
    through the PerlFAQ.
     
    Greg K, Apr 5, 2005
    #8
  9. Arturi

    Anno Siegel Guest

    Brian McCauley <> wrote in comp.lang.perl.misc:
    > Arturi wrote:
    >
    > > I wrote some perl lines that scans a file looking for some words and
    > > substitute them for their corresponding pairs, which are previously
    > > saved in a HASH.


    [...]

    > If you are concerned about speed you should take the comoposition and
    > compilatiion of the regex outside the loop.
    >
    > my @patterns = map { qr/(\W)($_)(\W)/ } keys %HASH;
    >
    > local *_; # Wise not to stomp on someone else's $_
    >
    > while (<FILE>) {
    > my @values = values %HASH;
    > for my $pattern ( @patterns ) {
    > my $value = shift @values;
    > s/$pattern/$1$value$3/g;
    > }
    > # print or whatever
    > }
    >
    > But my spider-sense is telling me that perhaps the keys of %HASH really
    > arbirary regexes but are just words and all you want is word
    > substitution. This is a classic bit of Perl.
    >
    > while (<FILE>) {
    > s/(\w+)/$HASH{$1}||$1/eg;
    > # print or whatever
    > }


    At first sight that looks rather inefficient because *every* word
    is captured and replaced (often by itself). Out of interest I
    benchmarked a few variants, and it turns out that it is just as
    fast as more selective solutions.

    I'm appending the benchmark script, in case anyone cares. The four
    variants tested are (least efficient first, the difference between
    the last two is insignificant):

    direct -- every regex is compiled from the string at run time
    regex -- all alternatives compiled into a big regex
    brian -- replace every word, like the second solution above
    precomp -- regular expressions precompiled via qr//, like the
    first solution above

    Anno

    --

    #!/usr/local/bin/perl
    use strict; use warnings; $| = 1;
    use Vi::QuickFix;

    use Benchmark qw( cmpthese);

    our @lines = <DATA>;

    my %repla; # string replacements (master)
    @repla{
    qw( one two three four five six seven eight nine ten eleven twelve)
    } = qw( ein zwei drei vier fuenf sechs sieben acht neun zehn elf zwoelf);

    my @repla = map [ qr/\b$_\b/ => $repla{ $_}], keys %repla; # precompiled

    my $repla = do { # single regex
    my $re = join '|', map "\\b$_\\b", keys %repla;
    qr/($re)/;
    };

    goto bench;

    check:
    print brian( @lines);
    print "\n", @lines;
    exit;

    bench:
    @lines = ( @lines) x 10; # vary to check linearity
    cmpthese -1, {
    direct => 'direct( @lines)',
    precomp => 'precomp( @lines)',
    brian => 'brian( @lines)',
    regex => 'regex( @lines)',

    };
    #######################################################################

    sub direct {
    my @new = @_;
    for ( @new ) {
    my ( $str, $rpl);
    s/\b$str\b/$rpl/g while ( $str, $rpl) = each %repla;
    }
    @new;
    }

    sub precomp {
    my @new = @_;
    for ( @new ) {
    for my $r ( @repla ) {
    s/$r->[ 0]/$r->[ 1]/g;
    }
    }
    @new ;
    }

    sub brian {
    my @new = @_;
    s/(\w+)/$repla{ $1} || $1/eg for @new;
    @new;
    }

    sub regex {
    my @new = @_;
    s/$repla/$repla{ $1}/g for @new;
    @new;
    }

    __DATA__
    one spot was detected on each of the two blades
    three jolly coachmen, three jolly coachmen
    twelve to a dozen
    a display of oneupmanship
    three o'clock four o'clock five o'clock ROCK
     
    Anno Siegel, Apr 5, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Weng Lei-QCH1840
    Replies:
    1
    Views:
    192
    Thomas
    Aug 15, 2003
  2. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    246
    Marc Bissonnette
    Jan 13, 2004
  3. Hemant Shah
    Replies:
    6
    Views:
    111
    Hemant Shah
    Sep 28, 2006
  4. Bobby Chamness
    Replies:
    2
    Views:
    242
    Xicheng Jia
    May 3, 2007
  5. Adam Kellas

    how to speed up a string-substitution loop?

    Adam Kellas, Mar 2, 2010, in forum: Perl Misc
    Replies:
    26
    Views:
    236
    RedGrittyBrick
    Mar 5, 2010
Loading...

Share This Page