optimizing text file searches

Discussion in 'Perl Misc' started by bivity, Aug 20, 2007.

  1. bivity

    bivity Guest

    My work requires a lot of index lookups for large amounts of files
    daily. Recently, we have been receiving files with one document per
    line along with all its attributes. This file will have around 400,000
    entries. I then receive another file, with just the file name, and I
    am told, look for each one of these files in this 400,000 entry list.
    There are about 5000 in the file.

    I just wrote a quick script to meet my needs, where I read in both
    files, and grep (search_item, content_file). It works pretty well.
    Except, it takes about 25 minutes for 5000 entries. I can't use a hash
    implementation here, is there a way I can make this search faster?

    Below is a sample search term and what the index line looks like,
    along with the entire script. I know its rough, but i wrote it in a
    hurry and I would like to refine it now, make it faster.

    Search File Term:
    885_Addm Un Lse 0867.pdf

    Large File to be searched, its matching index:
    "885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
    0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
    0867"

    script:

    #!/usr/local/bin/perl
    open(FILE, $ARGV[0]);
    my @text_rep=<FILE>; #file to search
    close FILE;
    open(FILE, $ARGV[1]);
    my @text_search=<FILE>; #file with entries to use in search
    close FILE;
    print "Size of content:". @text_rep ."\n";
    print "Searching for".@text_search. "instances\n";
    open (OUT, "+>did_not_find.txt"); # in case grep can't find the value,
    log it
    foreach my $query (@text_search)
    {
    chomp($query);
    $query =~s/\(/\./;
    $query =~s/\)/\./;
    $query =~s/\$/\./;
    my @qu = grep(/$query/,@text_rep);
    if($qu[0] eq '') {
    print OUT $query."\n";}#Error Logging
    else{
    push(@final,$qu[0]);}#Found, so place in array
    }
    close OUT;
    open (OUT, "+>lines_that_pulled.txt");
    print OUT @final;
    close OUT;
     
    bivity, Aug 20, 2007
    #1
    1. Advertising

  2. bivity

    Mirco Wahab Guest

    bivity wrote:
    > Search File Term:
    > 885_Addm Un Lse 0867.pdf


    That means, you do each search on
    4-5000 times (exactly such lines)
    against:

    > Large File to be searched, its matching index:
    > "885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
    > 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
    > 0867"


    Is this (above) *one line*, eg.:

    "885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse 0867"

    delimited by '\n' from the following line?

    Regards

    M.
     
    Mirco Wahab, Aug 20, 2007
    #2
    1. Advertising

  3. bivity

    bivity Guest

    On Aug 20, 1:36 pm, Mirco Wahab <-halle.de> wrote:
    > bivity wrote:
    > > Search File Term:
    > > 885_Addm Un Lse 0867.pdf

    >
    > That means, you do each search on
    > 4-5000 times (exactly such lines)
    > against:
    >
    > > Large File to be searched, its matching index:
    > > "885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
    > > 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
    > > 0867"

    >
    > Is this (above) *one line*, eg.:
    >
    > "885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse 0867"
    >
    > delimited by '\n' from the following line?
    >
    > Regards
    >
    > M.


    Yes,I know, I am linear searching, big O of (N*n)
    N being the lines in the input file
    n being lines in a file to search
    Which isn't efficient when we are talking 400k and 5k items. It was
    just the quick and dirty way I knew, and like to see if there is
    better way to do this.

    Each line in both files only contains one instance and yes it's
    delimited by \n, Solaris - SunOs 5.8 environment, Perl version 5.8, I
    can't use any extended libraries as well.

    Thanks for the help in advance.
     
    bivity, Aug 20, 2007
    #3
  4. bivity wrote:
    > My work requires a lot of index lookups for large amounts of files
    > daily. Recently, we have been receiving files with one document per
    > line along with all its attributes. This file will have around 400,000
    > entries. I then receive another file, with just the file name, and I
    > am told, look for each one of these files in this 400,000 entry list.
    > There are about 5000 in the file.
    >
    > I just wrote a quick script to meet my needs, where I read in both
    > files, and grep (search_item, content_file). It works pretty well.
    > Except, it takes about 25 minutes for 5000 entries. I can't use a hash
    > implementation here, is there a way I can make this search faster?


    why not a hash? It is the right tool for the job.
    >
    > Below is a sample search term and what the index line looks like,
    > along with the entire script. I know its rough, but i wrote it in a
    > hurry and I would like to refine it now, make it faster.
    >


    with grep, you search through the entire array of entries and
    create an array of matches but only print the first entry found.
    This seems very wasteful, especially if the first match is on line 1.


    > Search File Term:
    > 885_Addm Un Lse 0867.pdf
    >
    > Large File to be searched, its matching index:
    > "885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
    > 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
    > 0867"
    >
    > script:
    >
    > #!/usr/local/bin/perl
    > open(FILE, $ARGV[0]);
    > my @text_rep=<FILE>; #file to search
    > close FILE;
    > open(FILE, $ARGV[1]);
    > my @text_search=<FILE>; #file with entries to use in search
    > close FILE;
    > print "Size of content:". @text_rep ."\n";
    > print "Searching for".@text_search. "instances\n";
    > open (OUT, "+>did_not_find.txt"); # in case grep can't find the value,
    > log it
    > foreach my $query (@text_search)
    > {
    > chomp($query);
    > $query =~s/\(/\./;
    > $query =~s/\)/\./;
    > $query =~s/\$/\./;


    you are substituting one metachar for another in your regular
    expression. Read up on the \Q & \E operators in perlre so
    you won't have to use the above 3 lines. You can also improve the
    search by anchoring it to the beginning of the line if you are just
    looking for filenames preceeded by a quotation mark.

    > my @qu = grep(/$query/,@text_rep);


    my @qu = grep( /^"\Q$query/, @text_rep );

    > if($qu[0] eq '') {
    > print OUT $query."\n";}#Error Logging
    > else{
    > push(@final,$qu[0]);}#Found, so place in array
    > }
    > close OUT;
    > open (OUT, "+>lines_that_pulled.txt");
    > print OUT @final;
    > close OUT;
    >


    --
    brian
     
    Brian Helterline, Aug 20, 2007
    #4
  5. bivity

    Mirco Wahab Guest

    bivity wrote:
    > On Aug 20, 1:36 pm, Mirco Wahab <-halle.de> wrote:
    >> Is this (above) *one line*, eg.:
    >> "885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse 0867"
    >> delimited by '\n' from the following line?

    >
    > Each line in both files only contains one instance and yes it's
    > delimited by \n, Solaris - SunOs 5.8 environment, Perl version 5.8, I
    > can't use any extended libraries as well.


    OK, I created a 500_000 line "full input file" according to your "long line"
    template (with 2 random numbers [0-9999] in each record) and made a 5_000 line
    search file according to the same rule. This run on my old Athlon/2500+ (non-X64)
    in (user) 1min 52sec (and suprisingly lead to ~30 random hits).

    I used the "good old long regex" from or-ed alternatives (the "search terms")
    and did't consider printing non-matches:

    ---

    use strict;
    use warnings;

    my $full_fn = shift;
    my $srch_fn = shift;
    my $pull_fn = 'lines_that_pulled.txt';
    my $err_fn = 'did_not_find.txt';

    open my $fh, '<', $srch_fn or die $!; # file with entries to use in search
    my $ts ='^"('; # start big regex here
    while( <$fh> ) {
    if(length) {
    tr/)($\n/..../;
    $ts .= "(?:$_)|"
    }
    }
    close $fh;
    chop $ts;

    my $text_q = qr/$ts)/; # build regex

    print "Size of content: (unknown) \n";
    print "Searching for: (unknown) instances\n";

    my @final;
    open $fh, '<', $full_fn or die $!; # the 400+K lines
    while( <$fh> ) {
    push @final, (split /,/,$_)[0]
    if /$text_q/
    }
    close $fh;

    open $fh, '>', $pull_fn or die $!;
    print $fh join "\n", @final;
    close $fh;

    ---


    (FWIW)

    I don't really have an idea to make this faster,
    maybe someone has *the big flash* ;-)

    Regards

    M.
     
    Mirco Wahab, Aug 20, 2007
    #5
  6. bivity

    Mirco Wahab Guest

    Mirco Wahab wrote:
    > OK, I created a 500_000 line "full input file" according to your "long line"
    > template (with 2 random numbers [0-9999] in each record) and made a
    > 5_000 line search file according to the same rule.


    FWIW, this is the script used for these files:

    use strict;
    use warnings;

    open my $fh, '>', 'bigfile.txt' or die $!;
    for(1..500_000) {
    my $num1 = sprintf "%04d", int(rand(10000));
    my $num2 = sprintf "%d", int(rand(10000));
    print $fh
    qq{"${num2}_Addm Un Lse ${num1}.pdf","${num2}","ELM 111 N BOBBY AVE","Addm Un Lse ${num1}","Addm Un Lse ${num1}.pdf","Elmhurst","651","${num2}","${num2}_Addm Un Lse ${num1}"},"\n"
    }
    close $fh;

    open $fh, '>', 'searchfile.txt' or die $!;
    for(1..5_000) {
    my $num1 = sprintf "%04d", int(rand(10000));
    my $num2 = sprintf "%d", int(rand(10000));
    print $fh
    qq{${num2}_Addm Un Lse ${num1}.pdf}, "\n"
    }
    close $fh;


    Regards

    M.
     
    Mirco Wahab, Aug 20, 2007
    #6
  7. bivity

    bivity Guest

    On Aug 20, 2:34 pm, Brian Helterline <> wrote:
    > bivity wrote:
    > > My work requires a lot of index lookups for large amounts of files
    > > daily. Recently, we have been receiving files with one document per
    > > line along with all its attributes. This file will have around 400,000
    > > entries. I then receive another file, with just the file name, and I
    > > am told, look for each one of these files in this 400,000 entry list.
    > > There are about 5000 in the file.

    >
    > > I just wrote a quick script to meet my needs, where I read in both
    > > files, and grep (search_item, content_file). It works pretty well.
    > > Except, it takes about 25 minutes for 5000 entries. I can't use a hash
    > > implementation here, is there a way I can make this search faster?

    >
    > why not a hash? It is the right tool for the job.
    >
    >


    You're right, I took another quick look at it, and I was able to
    create a hash out of the content file. Thanks for reminding of the \Q
    literal strings, always escapes my memory.

    Well here is what I got in the end, night and day results.

    !/usr/local/bin/perl
    my %vals;
    open(FILE, $ARGV[0]);
    my @text_rep=<FILE>;
    close FILE;
    foreach my $t (@text_rep){
    $t=~/^\"(.*?)\",/;
    $vals {$1} = $t;
    }
    open(FILE, $ARGV[1]);
    my @text_search=<FILE>;
    close FILE;
    open (OUT, "+>did_not_find_beta.txt");
    foreach my $query (@text_search){
    chomp ($query);
    if(my $tmp=$vals{$query} ne ''){
    push(@final, $vals{$tmp});
    }
    else { print OUT $query."\n";
    }
    }
    close OUT;
    open (OUT, "+>lines_that_pulled_beta.txt");
    print OUT @final;
    close OUT;

    Thanks for all the help.
     
    bivity, Aug 20, 2007
    #7
  8. bivity

    bivity Guest

    On Aug 20, 3:52 pm, Jim Gibson <> wrote:
    > In article <>,
    >
    > bivity <> wrote:
    > > My work requires a lot of index lookups for large amounts of files
    > > daily. Recently, we have been receiving files with one document per
    > > line along with all its attributes. This file will have around 400,000
    > > entries. I then receive another file, with just the file name, and I
    > > am told, look for each one of these files in this 400,000 entry list.
    > > There are about 5000 in the file.

    >
    > > I just wrote a quick script to meet my needs, where I read in both
    > > files, and grep (search_item, content_file). It works pretty well.
    > > Except, it takes about 25 minutes for 5000 entries. I can't use a hash
    > > implementation here, is there a way I can make this search faster?

    >
    > A hash implementation would be faster. Why can you not use a hash?
    >
    >
    >
    > > Below is a sample search term and what the index line looks like,
    > > along with the entire script. I know its rough, but i wrote it in a
    > > hurry and I would like to refine it now, make it faster.

    >
    > > Search File Term:
    > > 885_Addm Un Lse 0867.pdf

    >
    > > Large File to be searched, its matching index:
    > > "885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
    > > 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
    > > 0867"

    >
    > > script:

    >
    > [snipped]
    >
    > Suggestions:
    >
    > 1. If all of your data looks like this example, then you are looking
    > for a fixed string in the first field of each record. Extract only the
    > first field into your @text_rep array and use the eq operator to
    > compare with your search strings. Check out the Text::CSV module.
    >
    > 2. Do not use grep. Instead, loop over your text_rep array and stop
    > when you get a match. Consider using the first() function from the
    > List::Util module (perldoc -q first).
    >
    > 3. For faster searching, sort your text_rep array and do a binary
    > search. See, for example, Search::Binary (I have not used it).
    >
    > 4. For even faster searching, use a hash.
    >
    > --
    > Jim Gibson
    >
    > Posted Via Usenet.com Premium Usenet Newsgroup Services
    > ----------------------------------------------------------
    > ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
    > ----------------------------------------------------------
    > http://www.usenet.com


    2. Tried the first function, last week but my company doesn't have
    many modules installed.

    4. I went with this, my 25 minute process, went to a mere 7
    seconds.... Got to love it.
     
    bivity, Aug 20, 2007
    #8
  9. bivity

    Mirco Wahab Guest

    bivity wrote:
    > You're right, I took another quick look at it, and I was able to
    > create a hash out of the content file. Thanks for reminding of the \Q
    > literal strings, always escapes my memory.


    From what I understood, you tried deliberately to
    convert ')($' to dot '.', meaning "any char"
    in a regular expression.

    > Well here is what I got in the end, night and day results.


    Nice!

    > !/usr/local/bin/perl
    > my %vals;
    > open(FILE, $ARGV[0]);
    > my @text_rep=<FILE>;
    > close FILE;
    > foreach my $t (@text_rep){
    > $t=~/^\"(.*?)\",/;
    > $vals {$1} = $t;
    > }
    > open(FILE, $ARGV[1]);
    > my @text_search=<FILE>;
    > close FILE;
    > open (OUT, "+>did_not_find_beta.txt");
    > foreach my $query (@text_search){
    > chomp ($query);
    > if(my $tmp=$vals{$query} ne ''){
    > push(@final, $vals{$tmp});
    > }
    > else { print OUT $query."\n";
    > }
    > }
    > close OUT;
    > open (OUT, "+>lines_that_pulled_beta.txt");
    > print OUT @final;
    > close OUT;


    You don't need these arrays at all. From
    reducing the complexity, you might gain
    another 50% run time. Especially the hash
    lookup can be optimized:

    use strict;
    use warnings;

    my $full_fn = shift;
    my $srch_fn = shift;
    my $pull_fn = 'lines_that_pulled_beta.txt';
    my $err_fn = 'did_not_find.txt';
    my (%vals, @final);

    open my $fh, '<', $full_fn or die $!;
    while ( <$fh> ) {
    chomp; # no need to use a temp array
    $vals{$1} = $_ if /"([^"]+)/
    }
    close $fh;

    open my $fh_err, '>', $err_fn or die $!;
    open $fh, '<', $srch_fn or die $!;
    while( <$fh> ) {
    chomp; # now do a hash lookup for existance of the key term
    exists $vals{$_} ? push @final, $vals{$_} : print $fh_err "$_\n"
    }
    close $fh, close $fh_err;

    open $fh, '>', $pull_fn or die $!;
    print $fh join "\n", @final;
    close $fh;



    But why did you, in your original post (or one later),
    stress that you *can't* use hash? Did you assume it
    would be to large?

    Regards

    M.
     
    Mirco Wahab, Aug 20, 2007
    #9
  10. bivity

    bivity Guest

    On Aug 20, 4:10 pm, Mirco Wahab <-halle.de> wrote:
    > bivity wrote:
    > > You're right, I took another quick look at it, and I was able to
    > > create a hash out of the content file. Thanks for reminding of the \Q
    > > literal strings, always escapes my memory.

    >
    > From what I understood, you tried deliberately to
    > convert ')($' to dot '.', meaning "any char"
    > in a regular expression.
    >
    > > Well here is what I got in the end, night and day results.

    >
    > Nice!
    >
    >
    >
    > > !/usr/local/bin/perl
    > > my %vals;
    > > open(FILE, $ARGV[0]);
    > > my @text_rep=<FILE>;
    > > close FILE;
    > > foreach my $t (@text_rep){
    > > $t=~/^\"(.*?)\",/;
    > > $vals {$1} = $t;
    > > }
    > > open(FILE, $ARGV[1]);
    > > my @text_search=<FILE>;
    > > close FILE;
    > > open (OUT, "+>did_not_find_beta.txt");
    > > foreach my $query (@text_search){
    > > chomp ($query);
    > > if(my $tmp=$vals{$query} ne ''){
    > > push(@final, $vals{$tmp});
    > > }
    > > else { print OUT $query."\n";
    > > }
    > > }
    > > close OUT;
    > > open (OUT, "+>lines_that_pulled_beta.txt");
    > > print OUT @final;
    > > close OUT;

    >
    > You don't need these arrays at all. From
    > reducing the complexity, you might gain
    > another 50% run time. Especially the hash
    > lookup can be optimized:
    >
    > use strict;
    > use warnings;
    >
    > my $full_fn = shift;
    > my $srch_fn = shift;
    > my $pull_fn = 'lines_that_pulled_beta.txt';
    > my $err_fn = 'did_not_find.txt';
    > my (%vals, @final);
    >
    > open my $fh, '<', $full_fn or die $!;
    > while ( <$fh> ) {
    > chomp; # no need to use a temp array
    > $vals{$1} = $_ if /"([^"]+)/
    > }
    > close $fh;
    >
    > open my $fh_err, '>', $err_fn or die $!;
    > open $fh, '<', $srch_fn or die $!;
    > while( <$fh> ) {
    > chomp; # now do a hash lookup for existance of the key term
    > exists $vals{$_} ? push @final, $vals{$_} : print $fh_err "$_\n"
    > }
    > close $fh, close $fh_err;
    >
    > open $fh, '>', $pull_fn or die $!;
    > print $fh join "\n", @final;
    > close $fh;
    >
    > But why did you, in your original post (or one later),
    > stress that you *can't* use hash? Did you assume it
    > would be to large?
    >
    > Regards
    >
    > M.


    That's great stuff, thanks!
    Originally, the index file was not going to match the file name:
    The file "Title Index" was being provided, IE: "Document.pdf"
    However, the Title Index was not found in the large file, and we had
    to do a look up based off the storage location, IE:
    "21392_2394993Document.pdf"
    It got hairy quick, now that I think about it, a hash could have been
    still applied, i just needed to clean up the storage location with a
    regex. But that was why at first I just wanted to make sure it worked
    before getting into a hash implementation. (It can be tricky to
    explain to non-coders what the script is doing on the fly when your
    superiors want to see the code and a demo, but that is a different
    story)
    I can't imagine life without Perl :p
     
    bivity, Aug 20, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Antonio Maciel

    Full-text searches and ASP.NET

    Antonio Maciel, Jun 27, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    463
    samham
    Jun 28, 2003
  2. Stephan Bour

    Active Directory searches

    Stephan Bour, Oct 15, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    652
    Chee Seong Ong
    Oct 15, 2003
  3. David Kleyman

    Long running searches

    David Kleyman, Jan 6, 2004, in forum: ASP .Net
    Replies:
    4
    Views:
    413
    Curt_C [MVP]
    Jan 6, 2004
  4. Nickolay Kolev

    Optimizing a text statistics function

    Nickolay Kolev, Apr 21, 2004, in forum: Python
    Replies:
    13
    Views:
    574
    Terry Reedy
    Apr 22, 2004
  5. pbd22
    Replies:
    7
    Views:
    423
    Joe Kesselman
    Nov 12, 2006
Loading...

Share This Page