Pattern match over mutiple files is slow - Help Needed !

Discussion in 'Perl Misc' started by RV, Oct 22, 2003.

  1. RV

    RV Guest

    < Note: Posted this message earlier today to comp.lang.perl - then
    realized that it was defunct. Hence posting it to this group >

    Hi:

    Am having a huge performance problem in one of my scripts.

    I have an array containing some reference keys. ( about 1000 entries
    or so ).
    I also have a list of files ( about 100 or so ) and I need to locate
    occurence of these keys in all of the files and replace with some
    value ( lets say the key-value hash is also given ).

    My code looks something like this:

    #Note: %keyval --> holds the key-value mapping
    # @keylist - is the array with the 1000 keys ( like keys %keyval )
    # @files - holds the list of files ( about 100 or so ).

    foreach $f ( @files )
    {
    #open file - validate etc - assume it is opened as <FH>
    while(<FH>) #each line
    {
    $line=$_ ;
    foreach $k (@keylist)
    {
    $line =~ s/$k/$keyval{$k}/ig ; #replace key with value
    } #key loop
    }
    close(FH);
    } #foreach

    This code works - but its too slow ! -- Obviously I run the inner loop
    1000 times for each line in the file.
    Constraints being that multiple keys may occur on the same line ( and
    even the same key will occur multiple times on the same line ).

    I tried globbing the file into a scalar ( unsetting $/ ) - no big
    difference in timing.

    Can someone help me here ? - If you can give some ideas that I can
    look into, I'll greatly appreciate it.
    Pseudocode is fine as well.

    If you can include a courtesy CC: that would be great !

    Thanks - hope I've conveyed my problem accurately ( this among my
    first posts - am a frequent "reader" though ! ).

    -RV.
     
    RV, Oct 22, 2003
    #1
    1. Advertising

  2. RV <> wrote:
    > #Note: %keyval --> holds the key-value mapping
    > # @keylist - is the array with the 1000 keys ( like keys %keyval )
    > # @files - holds the list of files ( about 100 or so ).


    How big are the files?

    > foreach $f ( @files )
    > {
    > #open file - validate etc - assume it is opened as <FH>
    > while(<FH>) #each line
    > {
    > $line=$_ ;
    > foreach $k (@keylist)
    > {
    > $line =~ s/$k/$keyval{$k}/ig ; #replace key with value
    > } #key loop
    > }
    > close(FH);
    > } #foreach


    > This code works - but its too slow ! -- Obviously I run the inner loop
    > 1000 times for each line in the file.


    If the files are small enough to hold in memory, then run the
    substitution on the entire file rather than line by line.

    undef $/;
    my $info =<FILE>;
    foreach $k (@keylist)
    {
    $info =~ s/$k/$keyval{$k}/img ; #replace key with value
    } #key loop

    That might be a big improvement already.



    Another question. How often do the patterns actually occur in the file?
    If they're pretty rare (many lines that don't require any
    substitutions), then you could do the original "match by line", but do a
    check on *all* the patterns for each line (fast) before attempting the
    substitutions (slower).

    # compile regex with all keys in it.
    my $re_str = join "|", @keylist;

    # file loop
    while (<FILE>)
    {
    if (/$re_str/)
    { # substitute loop }
    }


    --
    Darren Dunham
    Unix System Administrator Taos - The SysAdmin Company
    Got some Dr Pepper? San Francisco, CA bay area
    < This line left intentionally blank to confuse you. >
     
    Darren Dunham, Oct 22, 2003
    #2
    1. Advertising

  3. RV

    RV Guest

    Darren Dunham <> wrote in message news:<WmClb.3253$>...
    > RV <> wrote:
    > > #Note: %keyval --> holds the key-value mapping
    > > # @keylist - is the array with the 1000 keys ( like keys %keyval )
    > > # @files - holds the list of files ( about 100 or so ).

    >
    > How big are the files?


    It varies - I would say the average is about 150 Kbytes - usually not
    more than 250 Kbytes.
    >
    >
    > If the files are small enough to hold in memory, then run the
    > substitution on the entire file rather than line by line.
    >
    > undef $/;
    > my $info =<FILE>;
    > foreach $k (@keylist)
    > {
    > $info =~ s/$k/$keyval{$k}/img ; #replace key with value
    > } #key loop
    >
    > That might be a big improvement already.
    >


    I did try slurping the complete file - did not seem to make it any
    faster.
    It takes about 10 mins on large ( 120 Kbytes ) file for that loop :(

    >
    >
    > Another question. How often do the patterns actually occur in the file?
    > If they're pretty rare (many lines that don't require any
    > substitutions), then you could do the original "match by line", but do a
    > check on *all* the patterns for each line (fast) before attempting the
    > substitutions (slower).
    >
    > # compile regex with all keys in it.
    > my $re_str = join "|", @keylist;
    >
    > # file loop
    > while (<FILE>)
    > {
    > if (/$re_str/)
    > { # substitute loop }
    > }


    Yes - I expect to have patterns sparsely ( relatively speaking, of
    course ! ).
    Interesting - I will try your idea.
    BTW, Is there a limit on the size of the pattern joined by "|" ? -- As
    i said I expect about 1000 entries in the keylist. ( well other than
    physical memory limits :) ). ????

    FWIW, its now running on a SPARC Solaris 9 machine with about 1024 RAM
    - Single US-2 CPU.

    Thanks for the suggestions - I will try it and post back here .....
     
    RV, Oct 23, 2003
    #3
  4. Darren Dunham <> wrote:

    > # compile regex with all keys in it.
    > my $re_str = join "|", @keylist;



    This one bit me once, so I'll mention it to save reader's
    from experiencing similar pain.


    Better make that:

    my $re_str = join "|", sort {$b cmp $a} @keylist;

    (or sort keylist in descending order when it is populated.)

    /one|onerous/ # onerous _never_ matches

    /onerous|one/ # much better


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Oct 23, 2003
    #4
  5. RV

    Anno Siegel Guest

    RV <> wrote in comp.lang.perl.misc:
    > < Note: Posted this message earlier today to comp.lang.perl - then
    > realized that it was defunct. Hence posting it to this group >
    >
    > Hi:
    >
    > Am having a huge performance problem in one of my scripts.
    >
    > I have an array containing some reference keys. ( about 1000 entries
    > or so ).
    > I also have a list of files ( about 100 or so ) and I need to locate
    > occurence of these keys in all of the files and replace with some
    > value ( lets say the key-value hash is also given ).
    >
    > My code looks something like this:
    >
    > #Note: %keyval --> holds the key-value mapping
    > # @keylist - is the array with the 1000 keys ( like keys %keyval )
    > # @files - holds the list of files ( about 100 or so ).
    >
    > foreach $f ( @files )
    > {
    > #open file - validate etc - assume it is opened as <FH>
    > while(<FH>) #each line
    > {
    > $line=$_ ;
    > foreach $k (@keylist)
    > {
    > $line =~ s/$k/$keyval{$k}/ig ; #replace key with value
    > } #key loop
    > }
    > close(FH);
    > } #foreach
    >
    > This code works - but its too slow ! -- Obviously I run the inner loop
    > 1000 times for each line in the file.
    > Constraints being that multiple keys may occur on the same line ( and
    > even the same key will occur multiple times on the same line ).
    >
    > I tried globbing the file into a scalar ( unsetting $/ ) - no big
    > difference in timing.
    >
    > Can someone help me here ? - If you can give some ideas that I can
    > look into, I'll greatly appreciate it.
    > Pseudocode is fine as well.


    There's a FAQ about this, "perldoc -q 'many regular'" finds it.
    In short, it will point you to the qr// operator, which should
    indeed be a time saver here.

    Combining all patterns into a big regex of alternatives may help too.
    As Tad has pointed out in another post, make sure that long strings
    come before shorter ones (sort them), or you may get spurious matches.

    Finally, consider the rarely-used study() function. You may try
    slurping the entire files and study() the string before applying
    the regex. Whether this speeds things up must be seen.

    Anno
     
    Anno Siegel, Oct 23, 2003
    #5
  6. RV <> wrote:

    > I did try slurping the complete file - did not seem to make it any
    > faster.
    > It takes about 10 mins on large ( 120 Kbytes ) file for that loop :(


    Okay. I just wanted to make sure that you got the idea wasn't so much
    "slurping the file" as it was "putting the file into a single scalar and
    only doing @keylist substitutions instead of @keylist * #lines
    substitutions".

    >> Another question. How often do the patterns actually occur in the file?
    >> If they're pretty rare (many lines that don't require any
    >> substitutions), then you could do the original "match by line", but do a
    >> check on *all* the patterns for each line (fast) before attempting the
    >> substitutions (slower).
    >>
    >> # compile regex with all keys in it.
    >> my $re_str = join "|", @keylist;
    >>
    >> # file loop
    >> while (<FILE>)
    >> {
    >> if (/$re_str/)
    >> { # substitute loop }
    >> }


    > Yes - I expect to have patterns sparsely ( relatively speaking, of
    > course ! ).
    > Interesting - I will try your idea.
    > BTW, Is there a limit on the size of the pattern joined by "|" ? -- As
    > i said I expect about 1000 entries in the keylist. ( well other than
    > physical memory limits :) ). ????


    It's pretty big. I took a squid filter which attempted to match a URL
    to a list of advertizing sites. The original looped through each of the
    sites looking for a match. I combined them into a single regex that was
    orders of magnitude faster. I recall the total file of host and URL
    matches as several KB in size. It had no problems.

    --
    Darren Dunham
    Unix System Administrator Taos - The SysAdmin Company
    Got some Dr Pepper? San Francisco, CA bay area
    < This line left intentionally blank to confuse you. >
     
    Darren Dunham, Oct 23, 2003
    #6
  7. Tad McClellan <> wrote:
    > Darren Dunham <> wrote:


    >> # compile regex with all keys in it.
    >> my $re_str = join "|", @keylist;



    > This one bit me once, so I'll mention it to save reader's
    > from experiencing similar pain.


    > Better make that:


    > my $re_str = join "|", sort {$b cmp $a} @keylist;


    > (or sort keylist in descending order when it is populated.)


    > /one|onerous/ # onerous _never_ matches
    > /onerous|one/ # much better


    Ahh, good point. My previous time doing this I was matching whole items
    or nothing. So to really get it up to speed I did this..

    /^($re_str)$/

    There, order didn't matter at all. I had overlooked that it would be
    necessary to sort it for the general case.

    --
    Darren Dunham
    Unix System Administrator Taos - The SysAdmin Company
    Got some Dr Pepper? San Francisco, CA bay area
    < This line left intentionally blank to confuse you. >
     
    Darren Dunham, Oct 23, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. RV
    Replies:
    1
    Views:
    458
    Donald 'Paddy' McCarthy
    Oct 23, 2003
  2. thorsten
    Replies:
    1
    Views:
    483
  3. user

    editing mutiple html files

    user, Aug 3, 2006, in forum: HTML
    Replies:
    12
    Views:
    600
    Toby Inkster
    Aug 5, 2006
  4. user
    Replies:
    2
    Views:
    384
    Ed Seedhouse
    Aug 4, 2006
  5. Ken Fine
    Replies:
    1
    Views:
    697
    Steve C. Orr [MCSD, MVP, CSM, ASP Insider]
    Jul 31, 2007
Loading...

Share This Page