Fast String search algorithm

Discussion in 'Java' started by Roedy Green, Dec 16, 2011.

  1. Roedy Green

    Roedy Green Guest

    Let's say I had a million records each with a text field. I wanted to
    find all records that contained a given substring. Are there fast
    algorithms to do that or do you have to scan the whole thing linearly?

    I imagine you could build an index of words. Then maybe you could
    figure out which words contain the string, and narrow your search to
    records with those words.

    Similarly, I was wondering how you might build an index to all the
    files on a computer so that you could find one just by giving a wild
    card or partial filename.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    For me, the appeal of computer programming is that
    even though I am quite a klutz,
    I can still produce something, in a sense
    perfect, because the computer gives me as many
    chances as I please to get it right.
    Roedy Green, Dec 16, 2011
    #1
    1. Advertising

  2. Roedy Green

    Tom Anderson Guest

    On Fri, 16 Dec 2011, Roedy Green wrote:

    > Let's say I had a million records each with a text field. I wanted to
    > find all records that contained a given substring. Are there fast
    > algorithms to do that or do you have to scan the whole thing linearly?


    http://en.wikipedia.org/wiki/Inverted_index

    > I imagine you could build an index of words. Then maybe you could
    > figure out which words contain the string, and narrow your search to
    > records with those words.


    Exactly that!

    > Similarly, I was wondering how you might build an index to all the files
    > on a computer so that you could find one just by giving a wild card or
    > partial filename.


    There is a standard utility on unix that does just that:

    http://en.wikipedia.org/wiki/Locate_(Unix)

    The source is in C, but might be informative (this is for the 'mlocate'
    version):

    https://fedorahosted.org/mlocate/browser/src

    tom

    --
    Get a fucking hobby that isn't breathing, browsing 4chan, or fapping. --
    The Well Cultured Anonymous, on Manners
    Tom Anderson, Dec 16, 2011
    #2
    1. Advertising

  3. Roedy Green

    Tassilo Horn Guest

    Roedy Green <> writes:

    > Let's say I had a million records each with a text field. I wanted to
    > find all records that contained a given substring. Are there fast
    > algorithms to do that or do you have to scan the whole thing linearly?


    Sure. See http://en.wikipedia.org/wiki/Substring_index

    > Similarly, I was wondering how you might build an index to all the
    > files on a computer so that you could find one just by giving a wild
    > card or partial filename.


    On UNIXes, that's exactly what the locate tool does. Then you can
    search with globbing wildcards

    % # all files containing the substring foo
    % locate *foo*

    or even with regular expressions

    % # all files that end with "r.c" or "r.cpp"
    % locate -r '.*r\.\(c\|cpp\)$'

    Bye,
    Tassilo
    Tassilo Horn, Dec 16, 2011
    #3
  4. On Fri, 16 Dec 2011 10:08:02 -0800, Roedy Green
    <> wrote:

    >Let's say I had a million records each with a text field. I wanted to
    >find all records that contained a given substring. Are there fast
    >algorithms to do that or do you have to scan the whole thing linearly?


    Look up the Boyer-Moore string search algorithm.

    [snip]

    Sincerely,

    Gene Wirchenko
    Gene Wirchenko, Dec 16, 2011
    #4
  5. Gene Wirchenko <> wrote:
    > On Fri, 16 Dec 2011 10:08:02 -0800, Roedy Green
    > <> wrote:


    >>Let's say I had a million records each with a text field. I wanted to
    >>find all records that contained a given substring. Are there fast
    >>algorithms to do that or do you have to scan the whole thing linearly?


    > Look up the Boyer-Moore string search algorithm.


    Well, Boyer-Moore still scans linearly, but just does it faster.

    Still, you could use something Boyer-Moore related if you know which
    characters were in each record. If a record didn't have a character
    in your substring, then it couldn't have the substring.

    Still, understand Boyer-Moore before you get much farther
    with string searching.

    -- glen
    glen herrmannsfeldt, Dec 16, 2011
    #5
  6. Roedy Green

    Roedy Green Guest

    On Fri, 16 Dec 2011 20:07:25 +0100, Tassilo Horn
    <> wrote, quoted or indirectly quoted someone
    who said :

    >On UNIXes, that's exactly what the locate tool does. Then you can
    >search with globbing wildcards


    i am trying to talk the JPSoft people into adding a locate feature
    into Take Command. I wanted to be able to tell them roughly how to
    implement it.

    They file descriptions, and fast change directory give partial
    directory names. I hoped they could come up with an integrated system
    that was thread safe. (The current ones locks everybody out during
    refresh).
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    For me, the appeal of computer programming is that
    even though I am quite a klutz,
    I can still produce something, in a sense
    perfect, because the computer gives me as many
    chances as I please to get it right.
    Roedy Green, Dec 16, 2011
    #6
  7. Roedy Green

    Arne Vajhøj Guest

    On 12/16/2011 1:08 PM, Roedy Green wrote:
    > Let's say I had a million records each with a text field. I wanted to
    > find all records that contained a given substring. Are there fast
    > algorithms to do that or do you have to scan the whole thing linearly?
    >
    > I imagine you could build an index of words. Then maybe you could
    > figure out which words contain the string, and narrow your search to
    > records with those words.
    >
    > Similarly, I was wondering how you might build an index to all the
    > files on a computer so that you could find one just by giving a wild
    > card or partial filename.


    Use a database that support full text search.

    Arne
    Arne Vajhøj, Dec 17, 2011
    #7
  8. Roedy Green

    Lew Guest

    Arne Vajhøj wrote:
    > Roedy Green wrote:
    >> Let's say I had a million records each with a text field. I wanted to
    >> find all records that contained a given substring. Are there fast
    >> algorithms to do that or do you have to scan the whole thing linearly?
    >>
    >> I imagine you could build an index of words. Then maybe you could
    >> figure out which words contain the string, and narrow your search to
    >> records with those words.
    >>
    >> Similarly, I was wondering how you might build an index to all the
    >> files on a computer so that you could find one just by giving a wild
    >> card or partial filename.

    >
    > Use a database that support full text search.


    Perhaps
    <http://www.google.com/search?q=latent+semantic+indexing>

    Latent semantic indexing, or LSI, is an algorithm that maps word associations
    culled from document bases into cartographic clusters in n-space, then projects
    those clusters into m-space, m < n. The Wikipedia article (found in the
    referenced search) describes it in terms of the underlying matrix operations,
    which is great fun for those who like such things but loses the forest for the
    trees. I think of it as dusty blobs floating in infospace (like cyberspacebut
    doesn't need computers) with bright spots where terms (words, phrases, other
    semantic molecules) live near each other, like "driver" not far from "wedge".
    With lots of documents and terms, you have a lot of points of light, m times n
    dimensions. The reduction brings it down to about one to three hundred
    dimensions, but it's still a hyperspatial map (in the cartographic sense).

    It is compute-intensive to map out all those correlations but makes for not
    only a fast search, but one that seems almost magical in its ability to pull
    meaningful answers out of the morass. The term "latent semantic indexing"
    refers to how all that dusty matrix math ("indexing") pulls out the meaning
    ("semantic") lurking in the way words coalesce in meaningful documents
    ("latent"). The correlations capture the correlations in the content that
    embody the linguistic meaning, so search results correlate quite well to
    semantics in the query's meaning.

    To a person fond of maths, information science, library science, linguistics,
    semiotics, cartography, AI or document management LSI is a delight.

    --
    Lew






    --
    Lew
    Lew, Dec 17, 2011
    #8
  9. Roedy Green

    markspace Guest

    On 12/16/2011 10:58 AM, Tom Anderson wrote:
    > On Fri, 16 Dec 2011, Roedy Green wrote:
    >
    >> Let's say I had a million records each with a text field. I wanted to
    >> find all records that contained a given substring. Are there fast
    >> algorithms to do that or do you have to scan the whole thing linearly?

    >
    > http://en.wikipedia.org/wiki/Inverted_index
    >


    An implementation of an inverted index is often called a search engine.

    <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>

    One of the best known implementations is a web search engine. Alt
    Vista, Yahoo, and Google are well known examples.

    There's a Java based search engine I happen to know about, Lucene.

    <http://lucene.apache.org/java/docs/index.html>

    This is a roll-your-own library, but it has a lot of features. Beyond a
    cursory reading of the documentation I can't really say much about it
    unfortunately.
    markspace, Dec 17, 2011
    #9
  10. Roedy Green

    Daniel Pitts Guest

    On 12/17/11 9:26 AM, markspace wrote:
    > On 12/16/2011 10:58 AM, Tom Anderson wrote:
    >> On Fri, 16 Dec 2011, Roedy Green wrote:
    >>
    >>> Let's say I had a million records each with a text field. I wanted to
    >>> find all records that contained a given substring. Are there fast
    >>> algorithms to do that or do you have to scan the whole thing linearly?

    >>
    >> http://en.wikipedia.org/wiki/Inverted_index
    >>

    >
    > An implementation of an inverted index is often called a search engine.
    >
    > <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>
    >
    > One of the best known implementations is a web search engine. Alt Vista,
    > Yahoo, and Google are well known examples.
    >
    > There's a Java based search engine I happen to know about, Lucene.
    >
    > <http://lucene.apache.org/java/docs/index.html>
    >
    > This is a roll-your-own library, but it has a lot of features. Beyond a
    > cursory reading of the documentation I can't really say much about it
    > unfortunately.
    >

    Solr is a webservice(ish) program which is based around Lucene.

    <http://lucene.apache.org/solr/>
    Daniel Pitts, Dec 18, 2011
    #10
  11. On 16/12/2011 10:08 AM, Roedy Green wrote:
    > Let's say I had a million records each with a text field. I wanted to
    > find all records that contained a given substring. Are there fast
    > algorithms to do that or do you have to scan the whole thing linearly?
    >
    > I imagine you could build an index of words. Then maybe you could
    > figure out which words contain the string, and narrow your search to
    > records with those words.
    >
    > Similarly, I was wondering how you might build an index to all the
    > files on a computer so that you could find one just by giving a wild
    > card or partial filename.


    In addition to the other links mentioned, here are some other links I
    found helpful for the same problem:

    http://en.wikipedia.org/wiki/Bloom_filter
    http://en.wikipedia.org/wiki/Bitap_algorithm
    http://johannburkard.de/software/stringsearch/

    The Bloom Filter can be used to help reduce the set of files to search.
    Travers Naran, Dec 18, 2011
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    661
  2. Michele Simionato

    Python is darn fast (was: How fast is Python)

    Michele Simionato, Aug 23, 2003, in forum: Python
    Replies:
    13
    Views:
    558
  3. Replies:
    1
    Views:
    438
    Joe Seigh
    Dec 15, 2006
  4. fish
    Replies:
    1
    Views:
    377
  5. czechboy

    Fast search for all positions in a string

    czechboy, Jun 4, 2008, in forum: Javascript
    Replies:
    7
    Views:
    111
    Jorge
    Jun 5, 2008
Loading...

Share This Page