How to quickly search over a large number of files using python?


D

dwivedi.devika

Hi all,

I am a newbie to python.

I have about 500 search queries, and about 52000 files in which I have to find all matches for each of the 500 queries.

How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

Can someone give me a suggestion as to how to minimize the search time?

Thanks!
 
Ad

Advertisements

O

Oscar Benjamin

I am a newbie to python.

I have about 500 search queries, and about 52000 files in which I have to find all matches for each of the 500 queries.

How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

That would be the obvious way to do this.
Can someone give me a suggestion as to how to minimize the search time?

What do you mean by a "query"? (Code indicating how a query would
match would be helpful here.)


Oscar
 
D

Dave Angel

Hi all,

I am a newbie to python.

I have about 500 search queries, and about 52000 files in which I have to find all matches for each of the 500 queries.

How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

Can someone give me a suggestion as to how to minimize the search time?

Are these files text or binary? Are they an 8bit character set or
Unicode?

Without more information about what these "queries" are, it's not even
possible to say whether the above approach could work at all.

Please specify the nature of these queries, and whether all the queries
are of the same form. For example, it may be that each of the queries
is a simple search string, not containing newline or wildcard.

Or it may be that the queries are arbitrary regular expressions, with
some of them potentially matching a multi-line block of text.

Have you implemented the brute-force approach you describe, and is it
indeed too slow? By what factor? Does it take 1000 times as long as
desired, or 5 times? How about if you do one query for those 52000
files, is it still too slow? And by what factor?

Assuming each of the queries is independent, and that none of them need
more than one line to process, it might be possible to combine some or
all of those queries into a siimpler filter or filters. Then one could
speed up the process by applying the filter to each line, and only if
it triggers, to check the line with the individual queries.

You also don't indicate whether this is a one--time query, or whether
the same files might need later to be searched for a different set of
queries, or whether the queries might need to be applied to a
different set of files. Or whether the same search may need to be
repeated on a very similar set of files, or ...

Even in the most ideally placed set of constraints, some queries may
produce filters that are either harder to apply than the original
queries, or filters that produce so many candidates that this process
takes longer than just applying the queries brute-force.

Many times, optimization efforts focus on the wrong problem, or ignore
the relative costs of programmer time and machine time. Other times,
the problem being optimized is simply intractiable with current
technology.
 
Ad

Advertisements

R

Roy Smith

Hi all,

I am a newbie to python.

I have about 500 search queries, and about 52000 files in which I have to
find all matches for each of the 500 queries.

Before anybody can even begin to answer this question, we need to know
what you mean by "search query". Are you talking pattern matching,
keyword matching, fuzzy hits OK, etc? Give us a couple of examples of
the kind of searches you'd like to execute.

Also, is this a one-off thing, or are you planning to do many searches
over the same collection of files? If so, you will want to do some sort
of pre-processing or indexing to speed up the search execution. It's
extremely unlikely you want to reinvent the wheel here. There are tons
of search packages out there that do this sort of thing. Just a few to
check out include Apache Lucene, Apache Solr, and Xapian.
 

Top