Looking for lots of words in lots of files

Discussion in 'Python' started by brad, Jun 18, 2008.

  1. brad

    brad Guest

    Just wondering if anyone has ever solved this efficiently... not looking
    for specific solutions tho... just ideas.

    I have one thousand words and one thousand files. I need to read the
    files to see if some of the words are in the files. I can stop reading a
    file once I find 10 of the words in it. It's easy for me to do this with
    a few dozen words, but a thousand words is too large for an RE and too
    inefficient to loop, etc. Any suggestions?

    Thanks
    brad, Jun 18, 2008
    #1
    1. Advertising

  2. brad wrote:

    > Just wondering if anyone has ever solved this efficiently... not looking
    > for specific solutions tho... just ideas.
    >
    > I have one thousand words and one thousand files. I need to read the
    > files to see if some of the words are in the files. I can stop reading a
    > file once I find 10 of the words in it. It's easy for me to do this with
    > a few dozen words, but a thousand words is too large for an RE and too
    > inefficient to loop, etc. Any suggestions?


    Use an indexer, like lucene (available as pylucene) or a database that
    offers word-indices.

    Diez
    Diez B. Roggisch, Jun 18, 2008
    #2
    1. Advertising

  3. Upload, wait, and google them.

    Seriously tho, aside from using a real indexer, I would build a set
    of the words I'm looking for, and then loop over each file, looping
    over the words and doing quick checks for containment in the set. If
    so, add to a dict of file names to list of words found until the list
    hits 10 length. I don't think that would be a complicated solution
    and it shouldn't be terrible at performance.

    If you need to run this more than once, use an indexer.

    If you only need to use it once, use an indexer, so you learn how for
    next time.

    On Jun 18, 2008, at 10:28 AM, brad wrote:

    > Just wondering if anyone has ever solved this efficiently... not
    > looking for specific solutions tho... just ideas.
    >
    > I have one thousand words and one thousand files. I need to read
    > the files to see if some of the words are in the files. I can stop
    > reading a file once I find 10 of the words in it. It's easy for me
    > to do this with a few dozen words, but a thousand words is too
    > large for an RE and too inefficient to loop, etc. Any suggestions?
    >
    > Thanks
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    Calvin Spealman, Jun 18, 2008
    #3
  4. Calvin Spealman wrote:
    > Upload, wait, and google them.
    >
    > Seriously tho, aside from using a real indexer, I would build a set of
    > the words I'm looking for, and then loop over each file, looping over
    > the words and doing quick checks for containment in the set. If so, add
    > to a dict of file names to list of words found until the list hits 10
    > length. I don't think that would be a complicated solution and it
    > shouldn't be terrible at performance.
    >
    > If you need to run this more than once, use an indexer.
    >
    > If you only need to use it once, use an indexer, so you learn how for
    > next time.


    If you can't use an indexer, and performance matters, evaluate using
    grep and a shell script. Seriously.

    grep is a couple of orders of magnitude faster at pattern matching
    strings in files (and especially regexps) than python is. Even if you
    are invoking grep multiple times it is still likely to be faster than a
    "maximally efficient" single pass over the file in python. This
    realization was disappointing to me :)

    Kris
    Kris Kennaway, Jun 18, 2008
    #4
  5. brad

    Robert Bossy Guest

    brad wrote:
    > Just wondering if anyone has ever solved this efficiently... not
    > looking for specific solutions tho... just ideas.
    >
    > I have one thousand words and one thousand files. I need to read the
    > files to see if some of the words are in the files. I can stop reading
    > a file once I find 10 of the words in it. It's easy for me to do this
    > with a few dozen words, but a thousand words is too large for an RE
    > and too inefficient to loop, etc. Any suggestions?

    The quick answer would be:
    grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
    where WORDLIST is a file containing the thousand words, one per line.

    The more interesting answers would be to use either a suffix tree or an
    Aho-Corasick graph.

    - The suffix tree is a representation of the target string (your files)
    that allows to search quickly for a word. Your problem would then be
    solved by 1) building a suffix tree for your files, and 2) search for
    each word sequentially in the suffix tree.

    - The Aho-Corasick graph is a representation of the query word list that
    allows fast scanning of the words on a target string. Your problem would
    then be solved by 1) building an Aho-Corasick graph for the list of
    words, and 2) scan sequentially each file.

    The preference for using either one or the other depends on some details
    of your problems: the expected size of target files, the rate of
    overlaps between words in your list (are there common prefixes), will
    you repeat the operation with another word list or another set of files,
    etc. Personally, I'd lean towards Aho-Corasick, it is a matter of taste;
    the kind of applications that comes to my mind makes it more practical.

    Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
    find modules for building both data structures in the python package index.

    Cheers,
    RB
    Robert Bossy, Jun 18, 2008
    #5
  6. brad

    Robert Bossy Guest

    I forgot to mention another way: put one thousand monkeys to work on it. ;)

    RB

    Robert Bossy wrote:
    > brad wrote:
    >> Just wondering if anyone has ever solved this efficiently... not
    >> looking for specific solutions tho... just ideas.
    >>
    >> I have one thousand words and one thousand files. I need to read the
    >> files to see if some of the words are in the files. I can stop
    >> reading a file once I find 10 of the words in it. It's easy for me to
    >> do this with a few dozen words, but a thousand words is too large for
    >> an RE and too inefficient to loop, etc. Any suggestions?

    > The quick answer would be:
    > grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
    > where WORDLIST is a file containing the thousand words, one per line.
    >
    > The more interesting answers would be to use either a suffix tree or
    > an Aho-Corasick graph.
    >
    > - The suffix tree is a representation of the target string (your
    > files) that allows to search quickly for a word. Your problem would
    > then be solved by 1) building a suffix tree for your files, and 2)
    > search for each word sequentially in the suffix tree.
    >
    > - The Aho-Corasick graph is a representation of the query word list
    > that allows fast scanning of the words on a target string. Your
    > problem would then be solved by 1) building an Aho-Corasick graph for
    > the list of words, and 2) scan sequentially each file.
    >
    > The preference for using either one or the other depends on some
    > details of your problems: the expected size of target files, the rate
    > of overlaps between words in your list (are there common prefixes),
    > will you repeat the operation with another word list or another set of
    > files, etc. Personally, I'd lean towards Aho-Corasick, it is a matter
    > of taste; the kind of applications that comes to my mind makes it more
    > practical.
    >
    > Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
    > find modules for building both data structures in the python package
    > index.
    >
    > Cheers,
    > RB
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    Robert Bossy, Jun 18, 2008
    #6
  7. Kris Kennaway wrote:
    <cut>
    >
    > If you can't use an indexer, and performance matters, evaluate using
    > grep and a shell script. Seriously.
    >
    > grep is a couple of orders of magnitude faster at pattern matching
    > strings in files (and especially regexps) than python is. Even if you
    > are invoking grep multiple times it is still likely to be faster than a
    > "maximally efficient" single pass over the file in python. This
    > realization was disappointing to me :)
    >
    > Kris


    Adding to this:
    Then again, there is nothing wrong with wrapping grep from python and
    revert to a pure python 'solution' if the system has no grep.
    Reinventing the wheel is usually only practical if the existing ones
    aren't round :)

    --
    mph
    Martin P. Hellwig, Jun 18, 2008
    #7
  8. brad

    Jeff McNeil Guest

    On Jun 18, 10:29 am, "Diez B. Roggisch" <> wrote:
    > brad wrote:
    > > Just wondering if anyone has ever solved this efficiently... not looking
    > > for specific solutions tho... just ideas.

    >
    > > I have one thousand words and one thousand files. I need to read the
    > > files to see if some of the words are in the files. I can stop reading a
    > > file once I find 10 of the words in it. It's easy for me to do this with
    > > a few dozen words, but a thousand words is too large for an RE and too
    > > inefficient to loop, etc. Any suggestions?

    >
    > Use an indexer, like lucene (available as pylucene) or a database that
    > offers word-indices.
    >
    > Diez


    I've been toying around with Nucular (http://nucular.sourceforge.net/)
    a bit recently for some side projects. It's pure Python and seems to
    work fairly well for my needs. I haven't pumped all that much data
    into it, though.
    Jeff McNeil, Jun 18, 2008
    #8
  9. brad

    Cong Guest

    On Jun 18, 11:01 pm, Kris Kennaway <> wrote:
    > Calvin Spealman wrote:
    > > Upload, wait, and google them.

    >
    > > Seriously tho, aside from using a real indexer, I would build a set of
    > > thewordsI'mlookingfor, and then loop over each file, looping over
    > > thewordsand doing quick checks for containment in the set. If so, add
    > > to a dict of file names to list ofwordsfound until the list hits 10
    > > length. I don't think that would be a complicated solution and it
    > > shouldn't be terrible at performance.

    >
    > > If you need to run this more than once, use an indexer.

    >
    > > If you only need to use it once, use an indexer, so you learn how for
    > > next time.

    >
    > If you can't use an indexer, and performance matters, evaluate using
    > grep and a shell script.  Seriously.
    >
    > grep is a couple of orders of magnitude faster at pattern matching
    > strings infiles(and especially regexps) than python is.  Even if you
    > are invoking grep multiple times it is still likely to be faster than a
    > "maximally efficient" single pass over the file in python.  This
    > realization was disappointing to me :)
    >
    > Kris


    Alternatively, if you don't feel like writing shell scripts, you can
    write a Python program which auto-generate the desired shell script
    which utilizes grep. E.g. use Python for generating the file list
    which is passed to grep as arguments. ;-P
    Cong, Jun 19, 2008
    #9
  10. brad a écrit :
    > Just wondering if anyone has ever solved this efficiently... not looking
    > for specific solutions tho... just ideas.
    >
    > I have one thousand words and one thousand files. I need to read the
    > files to see if some of the words are in the files. I can stop reading a
    > file once I find 10 of the words in it. It's easy for me to do this with
    > a few dozen words, but a thousand words is too large for an RE and too
    > inefficient to loop, etc. Any suggestions?


    Full text indexing.
    Bruno Desthuilliers, Jun 19, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,085
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    361
    Barry Schwarz
    Oct 5, 2003
  3. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    420
    Daniel T.
    Feb 16, 2006
  4. BerlinBrown
    Replies:
    6
    Views:
    4,481
  5. coolneo
    Replies:
    9
    Views:
    186
    coolneo
    Jan 30, 2007
Loading...

Share This Page