Most efficient algorithm for matching

Discussion in 'Java' started by Timasmith, Feb 13, 2007.

  1. Timasmith

    Timasmith Guest

    Suppose I have a list of 10,000 items that you can order, each order
    item has an ID and a Name.

    If you type in a string and I want to find every item that CONTAINS
    that string, what is the most efficient way?

    I happen to be using java but I am open to general solutions.

    Obviously I could just loop through the list and do a match on each
    string but I cant help thinking one could be more efficient thatn
    O(n).

    A traditional database btree index doesnt help, but I think search
    engines must use something fast.

    thanks

    Tim
    Timasmith, Feb 13, 2007
    #1
    1. Advertising

  2. Timasmith wrote:
    > Suppose I have a list of 10,000 items that you can order, each order
    > item has an ID and a Name.
    >
    > If you type in a string and I want to find every item that CONTAINS
    > that string, what is the most efficient way?
    >
    > I happen to be using java but I am open to general solutions.
    >
    > Obviously I could just loop through the list and do a match on each
    > string but I cant help thinking one could be more efficient thatn
    > O(n).
    >
    > A traditional database btree index doesnt help, but I think search
    > engines must use something fast.
    >
    > thanks
    >
    > Tim
    >


    If your data isn't sorted just looping through it is the fastest way.

    --

    Knute Johnson
    email s/nospam/knute/
    Knute Johnson, Feb 13, 2007
    #2
    1. Advertising

  3. Timasmith

    Timasmith Guest

    On Feb 12, 9:49 pm, Knute Johnson <>
    wrote:
    > Timasmithwrote:
    > > Suppose I have a list of 10,000 items that you can order, each order
    > > item has an ID and a Name.

    >
    > > If you type in a string and I want to find every item that CONTAINS
    > > that string, what is the most efficient way?

    >
    > > I happen to be using java but I am open to general solutions.

    >
    > > Obviously I could just loop through the list and do a match on each
    > > string but I cant help thinking one could be more efficient thatn
    > > O(n).

    >
    > > A traditional database btree index doesnt help, but I think search
    > > engines must use something fast.

    >
    > > thanks

    >
    > > Tim

    >
    > If your data isn't sorted just looping through it is the fastest way.
    >
    > --
    >
    > Knute Johnson
    > email s/nospam/knute/- Hide quoted text -
    >
    > - Show quoted text -


    Maybe. What if I constrained the search to say that the string must
    have a least 2 characters. So I guess you could take all 10,000
    entries, determine all of the sequences and put those into a
    hashtable. So if I search on chair it would find all the indexed
    entries with ch in the them, then look through the results. If <5% of
    the entries have ch in them, I would imagine that would help.
    Timasmith, Feb 13, 2007
    #3
  4. Timasmith

    Lew Guest

    Timasmith wrote:
    > On Feb 12, 9:49 pm, Knute Johnson <>
    > wrote:
    >> Timasmithwrote:
    >>> Suppose I have a list of 10,000 items that you can order, each order
    >>> item has an ID and a Name.


    Maybe some more outré structure like a trie will help:
    <http://en.wikipedia.org/wiki/Trie>
    in particular a suffix tree:
    <http://en.wikipedia.org/wiki/Suffix_tree>

    other structures:
    <http://en.wikipedia.org/wiki/List_of_data_structures>

    - Lew
    Lew, Feb 13, 2007
    #4
  5. Timasmith

    Eric Sosman Guest

    Timasmith wrote:
    > On Feb 12, 9:49 pm, Knute Johnson <>
    > wrote:
    >> Timasmithwrote:
    >>> Suppose I have a list of 10,000 items that you can order, each order
    >>> item has an ID and a Name.
    >>> If you type in a string and I want to find every item that CONTAINS
    >>> that string, what is the most efficient way?
    >>> I happen to be using java but I am open to general solutions.
    >>> Obviously I could just loop through the list and do a match on each
    >>> string but I cant help thinking one could be more efficient thatn
    >>> O(n).
    >>> A traditional database btree index doesnt help, but I think search
    >>> engines must use something fast.
    >>> thanks
    >>> Tim

    >> If your data isn't sorted just looping through it is the fastest way.
    >>
    >> --
    >>
    >> Knute Johnson
    >> email s/nospam/knute/- Hide quoted text -
    >>
    >> - Show quoted text -

    >
    > Maybe. What if I constrained the search to say that the string must
    > have a least 2 characters. So I guess you could take all 10,000
    > entries, determine all of the sequences and put those into a
    > hashtable. So if I search on chair it would find all the indexed
    > entries with ch in the them, then look through the results. If <5% of
    > the entries have ch in them, I would imagine that would help.


    If you're doing just a few searches, a straight-line pass
    over all the data is probably best.

    If you're doing many searches, it may make sense to invest
    a fairly large amount of work in initialization to make the
    searches themselves run faster. Here's one way:

    Map<String,Set<Item>> map = new HashMap<String,Set<Item>>();
    for (Item item : itemCollection) {
    for (String key : item.setOfAllKeys()) {
    Set<Item> set = map.get(key);
    if (set == null) {
    set = new Set<Item>();
    map.put(key, set);
    }
    set.add(item);
    }
    }

    That is, you have a map that associates each substring with a
    set of all the Items having that substring. Each Item appears
    in as many sets as it has distinct substrings.

    It may be desirable to endow setOfAllKeys() with a little
    more intelligence than a simple brute-force decomposition would
    exhibit. If one of your Items is "LEFT HANDED MONKEY WRENCH"
    there's probably not much to be gained by indexing it under "FT"
    and "D M" and "AND" -- yet indexing under "WRENCHES" might be
    a good idea, even though it's not a substring of the name.

    See also "The Art of Computer Programming, Volume III:
    Sorting and Searching" by D.E. Knuth. IIRC, one approach he
    describes for this kind of search is to start with "LEFT HANDED
    MONKEY WRENCH" and enter it in all of the lists LE, EF, FT, ...
    DE, ED. Then given the search key "DED MONK" you would form
    the intersection of the lists for DE, ED, ..., NK (keeping the
    lists sorted would be helpful, or you could use bit vectors)
    and finish by inspecting only those Items in the intersection.

    --
    Eric Sosman
    lid
    Eric Sosman, Feb 13, 2007
    #5
  6. Timasmith wrote:
    > On Feb 12, 9:49 pm, Knute Johnson <>
    > wrote:
    >> Timasmithwrote:
    >>> Suppose I have a list of 10,000 items that you can order, each order
    >>> item has an ID and a Name.
    >>> If you type in a string and I want to find every item that CONTAINS
    >>> that string, what is the most efficient way?
    >>> I happen to be using java but I am open to general solutions.
    >>> Obviously I could just loop through the list and do a match on each
    >>> string but I cant help thinking one could be more efficient thatn
    >>> O(n).
    >>> A traditional database btree index doesnt help, but I think search
    >>> engines must use something fast.
    >>> thanks
    >>> Tim

    >> If your data isn't sorted just looping through it is the fastest way.
    >>
    >> --
    >>
    >> Knute Johnson
    >> email s/nospam/knute/- Hide quoted text -
    >>
    >> - Show quoted text -

    >
    > Maybe. What if I constrained the search to say that the string must
    > have a least 2 characters. So I guess you could take all 10,000
    > entries, determine all of the sequences and put those into a
    > hashtable. So if I search on chair it would find all the indexed
    > entries with ch in the them, then look through the results. If <5% of
    > the entries have ch in them, I would imagine that would help.
    >
    >
    >


    With 10,000 items, I doubt very much that you can beat a linear search
    by very much with a complicated data structure. With 10,000,00 I can
    see where it would be a different story. A linear search is pretty easy
    to code. 10,000 classes of 1000 bytes only takes up 10mb. Easy to
    store in memory. 10 million items on the other hand would probably have
    to be kept on some kind of permanent storage. Or you might want to use
    some real database manager instead and use its searching capabilities.

    --

    Knute Johnson
    email s/nospam/knute/
    Knute Johnson, Feb 13, 2007
    #6
  7. Timasmith

    bugbear Guest

    Timasmith wrote:
    > Suppose I have a list of 10,000 items that you can order, each order
    > item has an ID and a Name.
    >
    > If you type in a string and I want to find every item that CONTAINS
    > that string, what is the most efficient way?


    Just to clarify; you wish to detect your target string (e.g. "har")
    within the Name of your item object?

    e.g. "harold" is a hit, "harry morse" is a hit, "mr haroldson" is a hit,
    "nigel smith" is a miss?

    BugBear
    bugbear, Feb 13, 2007
    #7
  8. Timasmith

    Timasmith Guest

    On Feb 13, 4:57 am, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > Timasmithwrote:
    > > Suppose I have a list of 10,000 items that you can order, each order
    > > item has an ID and a Name.

    >
    > > If you type in a string and I want to find every item that CONTAINS
    > > that string, what is the most efficient way?

    >
    > Just to clarify; you wish to detect your target string (e.g. "har")
    > within the Name of your item object?
    >
    > e.g. "harold" is a hit, "harry morse" is a hit, "mr haroldson" is a hit,
    > "nigel smith" is a miss?
    >
    > BugBear



    Correct.
    Timasmith, Feb 13, 2007
    #8
  9. Timasmith

    bugbear Guest

    Timasmith wrote:
    > On Feb 13, 4:57 am, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    >> Timasmithwrote:
    >>> Suppose I have a list of 10,000 items that you can order, each order
    >>> item has an ID and a Name.
    >>> If you type in a string and I want to find every item that CONTAINS
    >>> that string, what is the most efficient way?

    >> Just to clarify; you wish to detect your target string (e.g. "har")
    >> within the Name of your item object?
    >>
    >> e.g. "harold" is a hit, "harry morse" is a hit, "mr haroldson" is a hit,
    >> "nigel smith" is a miss?
    >>

    >
    > Correct.


    OK. Any other constraints on alphabet (i.e. number of distinct characters
    allowed for use in "Name"), or search targets (whole words?),
    number of insertions to the item list compared with number of retrievals (searches)?

    BugBear
    bugbear, Feb 13, 2007
    #9
  10. Timasmith

    Jeff Higgins Guest

    Timasmith wrote:
    > Suppose I have a list of 10,000 items that you can order, each order
    > item has an ID and a Name.
    >
    > If you type in a string and I want to find every item that CONTAINS
    > that string, what is the most efficient way?
    >
    > I happen to be using java but I am open to general solutions.


    http://lucene.apache.org/java/docs/

    >
    > Obviously I could just loop through the list and do a match on each
    > string but I cant help thinking one could be more efficient thatn
    > O(n).
    >
    > A traditional database btree index doesnt help, but I think search
    > engines must use something fast.
    >
    > thanks
    >
    > Tim
    >
    Jeff Higgins, Feb 13, 2007
    #10
  11. Timasmith

    Chris Uppal Guest

    Timasmith wrote:

    > Suppose I have a list of 10,000 items that you can order, each order
    > item has an ID and a Name.
    >
    > If you type in a string and I want to find every item that CONTAINS
    > that string, what is the most efficient way?


    Efficient text searching is a /big/ subject. No, let me be more precise: it's
    /HUGE/ subject.

    If your list of items is reasonably short, and the items themselves are
    reasonably short, then you can do yourself a big favour by ignoring efficiency
    and just searching linearly. A lot depends on the application, but I'm not
    convinced that 10K items is enough to justify heavier-weight solutions.

    Next up, there are efficient text searching algorithms, such as Boyer-Moore,
    and the interesting newish breed of "bit-parallel" or "bitmap" searching
    algorithms. If you have enough text, and if the overhead (in time, space, or
    convenience) of maintaining an index is unacceptable, then looking into those
    algorithms would be sensible. Note that there variants on these algorithms
    which do approximate matching. You might look for "agrep" for instance.

    Also Wikipeadia has quite a few pages on text searching.
    http://en.wikipedia.org/wiki/Category:Search_algorithms

    Lastly, if you need greater speed than such methods can provide (and search
    engines /'definitely/ do ;-) then you'd need to use some sort of indexing.

    The simplest way to approach that is to ensure that your database has something
    like "full text search and indexing" in its feature set -- and use that.

    Alternatively, there are off-the-shelf packages for bulk test searching. Jeff
    Higgins has already pointed you towards Apache Lucene, which is one such
    package.

    -- chris
    Chris Uppal, Feb 13, 2007
    #11
  12. Timasmith

    H. S. Lahman Guest

    Responding to Timasmith...

    > Suppose I have a list of 10,000 items that you can order, each order
    > item has an ID and a Name.


    The key will be ordering, but for just 10K items one probably doesn't
    want to get too elaborate.

    > If you type in a string and I want to find every item that CONTAINS
    > that string, what is the most efficient way?


    I assume you mean in the Name rather than the ID. I also assume that the
    data gets stored once but is searched often.

    Is there a limit on the lengths of the Names? What is the average
    length? The answers to these questions will drive tailoring the
    solutions below to trade-off speed vs. effort. Can you define the ID
    (e.g., an index for the order the data items are stored)?

    >
    > I happen to be using java but I am open to general solutions.
    >
    > Obviously I could just loop through the list and do a match on each
    > string but I cant help thinking one could be more efficient thatn
    > O(n).


    I would be inclined to make use of a couple of practical
    characteristics. In searching ASCII keys most rejections will occur in
    the first few characters. (It has been many, many moons since I've done
    searches so I forget the actual numbers for the distribution.) Also,
    string matching algorithms for a given starting point in each string are
    quite efficient.

    I think one "cheap" approach would be to keep an ordered list of
    digraphs or trigraphs that appear anywhere in the Name. That list gets
    updated whenever an item is added/removed from the data. The list would
    contain multiple entries consisting of the item ID and the start
    position of the digraph/trigraph in the Name. (Maybe the remaining
    length if the search keys are likely to be long so that you can
    eliminate embedded substrings that would extend beyond the Name length.)

    The data itself would be ordered in an array and the "ID" above would be
    an index into the storage position (either the original ID or one you
    create when storing the data). Your algorithm would find the starting
    digraph/trigraph in the list above and "walk" the entries for that
    digraph/trigraph. For each one it would do a lookup to get the data and
    use a conventional substring matching algorithm for the full search
    string at the indicated start point.

    There are even less elaborate versions. You can simply keep a list of
    data items that contain the starting letter of the search string
    somewhere and search just them. Then all you need is to store the data
    so you can access items by an index referenced in the starting letter list.

    Bottom line: you have a trade-off between how fast you want to be and
    how much work you want to put into storing the data items. As a fairly
    general rule the search will be in direct proportional to how much space
    you have for storing supporting infrastructure like index tables.

    For example, you could create 26 trees whose nodes were letters. The
    root node in each tree would be one of the possible first letters of the
    Name. Each child node would be a character that immediately followed the
    parent as the next letter of the Name (pretty much like a B-Tree). Each
    node would have a list of items that contained the substring of letters
    leading to that node. Then you could create a similar set of trees where
    the root nodes were the second letter of the name and so on. The search
    algorithm would "walk" the trees in each set from the node with the
    proper starting letter. This would be very fast but it would require
    substantially more storage space than the original data. [One could get
    even more carried away if one uses a digraph or trigraph as the root
    node in each tree. B-)]


    *************
    There is nothing wrong with me that could
    not be cured by a capful of Drano.

    H. S. Lahman

    Pathfinder Solutions
    http://www.pathfindermda.com
    blog: http://pathfinderpeople.blogs.com/hslahman
    "Model-Based Translation: The Next Step in Agile Development". Email
    for your copy.
    Pathfinder is hiring:
    http://www.pathfindermda.com/about_us/careers_pos3.php.
    (888)OOA-PATH
    H. S. Lahman, Feb 13, 2007
    #12
  13. Timasmith

    Timasmith Guest

    On Feb 13, 7:27 am, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > Timasmithwrote:
    > > On Feb 13, 4:57 am, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > >> Timasmithwrote:
    > >>> Suppose I have a list of 10,000 items that you can order, each order
    > >>> item has an ID and a Name.
    > >>> If you type in a string and I want to find every item that CONTAINS
    > >>> that string, what is the most efficient way?
    > >> Just to clarify; you wish to detect your target string (e.g. "har")
    > >> within the Name of your item object?

    >
    > >> e.g. "harold" is a hit, "harry morse" is a hit, "mr haroldson" is a hit,
    > >> "nigel smith" is a miss?

    >
    > > Correct.

    >
    > OK. Any other constraints on alphabet (i.e. number of distinct characters
    > allowed for use in "Name"), or search targets (whole words?),
    > number of insertions to the item list compared with number of retrievals (searches)?
    >
    > BugBear


    No constraints on alphabet, most of the search targets are single
    words from 3-30 characters.

    Insertions is rare, the list does not change from day to day,
    retreivals is measured in thousands or even tens of thousands per hour.
    Timasmith, Feb 13, 2007
    #13
  14. Timasmith

    Daniel Pitts Guest

    On Feb 12, 7:25 pm, "Timasmith" <> wrote:
    > On Feb 12, 9:49 pm, Knute Johnson <>
    > wrote:
    >
    >
    >
    > > Timasmithwrote:
    > > > Suppose I have a list of 10,000 items that you can order, each order
    > > > item has an ID and a Name.

    >
    > > > If you type in a string and I want to find every item that CONTAINS
    > > > that string, what is the most efficient way?

    >
    > > > I happen to be using java but I am open to general solutions.

    >
    > > > Obviously I could just loop through the list and do a match on each
    > > > string but I cant help thinking one could be more efficient thatn
    > > > O(n).

    >
    > > > A traditional database btree index doesnt help, but I think search
    > > > engines must use something fast.

    >
    > > > thanks

    >
    > > > Tim

    >
    > > If your data isn't sorted just looping through it is the fastest way.

    >
    > > --

    >
    > > Knute Johnson
    > > email s/nospam/knute/- Hide quoted text -

    >
    > > - Show quoted text -

    >
    > Maybe. What if I constrained the search to say that the string must
    > have a least 2 characters. So I guess you could take all 10,000
    > entries, determine all of the sequences and put those into a
    > hashtable. So if I search on chair it would find all the indexed
    > entries with ch in the them, then look through the results. If <5% of
    > the entries have ch in them, I would imagine that would help.



    You could also look into a third-party indexing library.

    Check out: Lucene.
    or for a web service: Solr

    Lucene was developed by an (ex?) Google employee.
    Solr was developed at, and is used extensively at, CNET Networks.
    Daniel Pitts, Feb 13, 2007
    #14
  15. Timasmith

    Timasmith Guest

    On Feb 13, 12:07 pm, "H. S. Lahman" <> wrote:
    > Responding toTimasmith...
    >
    > > Suppose I have a list of 10,000 items that you can order, each order
    > > item has an ID and a Name.

    >
    > The key will be ordering, but for just 10K items one probably doesn't
    > want to get too elaborate.
    >
    > > If you type in a string and I want to find every item that CONTAINS
    > > that string, what is the most efficient way?

    >
    > I assume you mean in the Name rather than the ID. I also assume that the
    > data gets stored once but is searched often.
    >
    > Is there a limit on the lengths of the Names? What is the average
    > length? The answers to these questions will drive tailoring the
    > solutions below to trade-off speed vs. effort. Can you define the ID
    > (e.g., an index for the order the data items are stored)?
    >
    >
    >
    > > I happen to be using java but I am open to general solutions.

    >
    > > Obviously I could just loop through the list and do a match on each
    > > string but I cant help thinking one could be more efficient thatn
    > > O(n).

    >
    > I would be inclined to make use of a couple of practical
    > characteristics. In searching ASCII keys most rejections will occur in
    > the first few characters. (It has been many, many moons since I've done
    > searches so I forget the actual numbers for the distribution.) Also,
    > string matching algorithms for a given starting point in each string are
    > quite efficient.
    >
    > I think one "cheap" approach would be to keep an ordered list of
    > digraphs or trigraphs that appear anywhere in the Name. That list gets
    > updated whenever an item is added/removed from the data. The list would
    > contain multiple entries consisting of the item ID and the start
    > position of the digraph/trigraph in the Name. (Maybe the remaining
    > length if the search keys are likely to be long so that you can
    > eliminate embedded substrings that would extend beyond the Name length.)
    >
    > The data itself would be ordered in an array and the "ID" above would be
    > an index into the storage position (either the original ID or one you
    > create when storing the data). Your algorithm would find the starting
    > digraph/trigraph in the list above and "walk" the entries for that
    > digraph/trigraph. For each one it would do a lookup to get the data and
    > use a conventional substring matching algorithm for the full search
    > string at the indicated start point.
    >
    > There are even less elaborate versions. You can simply keep a list of
    > data items that contain the starting letter of the search string
    > somewhere and search just them. Then all you need is to store the data
    > so you can access items by an index referenced in the starting letter list.
    >
    > Bottom line: you have a trade-off between how fast you want to be and
    > how much work you want to put into storing the data items. As a fairly
    > general rule the search will be in direct proportional to how much space
    > you have for storing supporting infrastructure like index tables.
    >
    > For example, you could create 26 trees whose nodes were letters. The
    > root node in each tree would be one of the possible first letters of the
    > Name. Each child node would be a character that immediately followed the
    > parent as the next letter of the Name (pretty much like a B-Tree). Each
    > node would have a list of items that contained the substring of letters
    > leading to that node. Then you could create a similar set of trees where
    > the root nodes were the second letter of the name and so on. The search
    > algorithm would "walk" the trees in each set from the node with the
    > proper starting letter. This would be very fast but it would require
    > substantially more storage space than the original data. [One could get
    > even more carried away if one uses a digraph or trigraph as the root
    > node in each tree. B-)]
    >
    > *************
    > There is nothing wrong with me that could
    > not be cured by a capful of Drano.
    >
    > H. S. Lahman
    >
    > Pathfinder Solutionshttp://www.pathfindermda.com
    > blog:http://pathfinderpeople.blogs.com/hslahman
    > "Model-Based Translation: The Next Step in Agile Development". Email
    > for your copy.
    > Pathfinder is hiring:http://www.pathfindermda.com/about_us/careers_pos3.php.
    > (888)OOA-PATH



    So I can point you to a file which is near exactly what I would be
    searching.

    http://www.fda.gov/cder/ndc/ziptext.zip

    It is a list of medications. There are 50,000 which is a reasonable
    representation. Look at column G in listings.txt. If you take that
    it is 1.5MB in size.

    If I load up the items into an ArrayList and then search linearly for
    items matching WELLBUTRIN

    then do to 1000 searches takes 9-11 seconds. So 100 searches a second
    seems rather fast, maybe I am worrying about nothing.

    <pre>
    // build file
    String contents = FileUtil.getTextContents("C:\\temp\\a.txt");
    java.util.StringTokenizer st = new
    java.util.StringTokenizer(contents,"\n\r");
    ArrayList<String> list = new ArrayList<String>(50000);
    while (st.hasMoreTokens()) {
    list.add(st.nextToken());
    }

    // start timer
    System.out.println(Timer.getNow());
    String match = "WELLBUTRIN";
    ArrayList<Integer> matches = new ArrayList<Integer>();
    for (int i=0; i<1000; i++) {
    matches.clear();
    for(int j=0; j<list.size(); j++) {
    if (list.get(j).indexOf(match) > -1) {
    matches.add(j);
    }
    }
    }
    //stop timer
    System.out.println(Timer.getNow());
    //System.out.println(matches);
    </pre>
    Timasmith, Feb 13, 2007
    #15
  16. Timasmith

    Christian Guest

    Timasmith schrieb:
    > On Feb 13, 4:57 am, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    >> Timasmithwrote:
    >>> Suppose I have a list of 10,000 items that you can order, each order
    >>> item has an ID and a Name.
    >>> If you type in a string and I want to find every item that CONTAINS
    >>> that string, what is the most efficient way?

    >> Just to clarify; you wish to detect your target string (e.g. "har")
    >> within the Name of your item object?
    >>
    >> e.g. "harold" is a hit, "harry morse" is a hit, "mr haroldson" is a hit,
    >> "nigel smith" is a miss?
    >>
    >> BugBear

    >
    >
    > Correct.
    >


    search for ukkonen's algorithm ...
    its an algorithm for substring trees

    building is in O(n)
    storage needed in O(n)
    and searching in O(1)
    that may fit your needs ...
    Christian, Feb 13, 2007
    #16
  17. Timasmith

    Ron Jeffries Guest

    On 12 Feb 2007 18:27:21 -0800, "Timasmith" <>
    wrote:

    >Suppose I have a list of 10,000 items that you can order, each order
    >item has an ID and a Name.
    >
    >If you type in a string and I want to find every item that CONTAINS
    >that string, what is the most efficient way?
    >
    >I happen to be using java but I am open to general solutions.
    >
    >Obviously I could just loop through the list and do a match on each
    >string but I cant help thinking one could be more efficient thatn
    >O(n).
    >
    >A traditional database btree index doesnt help, but I think search
    >engines must use something fast.


    Before doing anything else, I'd suggest that you code up the simple
    search and time it. My guess is it'll fly.

    Ron Jeffries
    www.XProgramming.com
    Ron Jeffries, Feb 13, 2007
    #17
  18. Timasmith wrote:
    > On Feb 13, 12:07 pm, "H. S. Lahman" <> wrote:
    >> Responding toTimasmith...
    >>
    >>> Suppose I have a list of 10,000 items that you can order, each order
    >>> item has an ID and a Name.

    >> The key will be ordering, but for just 10K items one probably doesn't
    >> want to get too elaborate.
    >>
    >>> If you type in a string and I want to find every item that CONTAINS
    >>> that string, what is the most efficient way?

    >> I assume you mean in the Name rather than the ID. I also assume that the
    >> data gets stored once but is searched often.
    >>
    >> Is there a limit on the lengths of the Names? What is the average
    >> length? The answers to these questions will drive tailoring the
    >> solutions below to trade-off speed vs. effort. Can you define the ID
    >> (e.g., an index for the order the data items are stored)?
    >>
    >>
    >>
    >>> I happen to be using java but I am open to general solutions.
    >>> Obviously I could just loop through the list and do a match on each
    >>> string but I cant help thinking one could be more efficient thatn
    >>> O(n).

    >> I would be inclined to make use of a couple of practical
    >> characteristics. In searching ASCII keys most rejections will occur in
    >> the first few characters. (It has been many, many moons since I've done
    >> searches so I forget the actual numbers for the distribution.) Also,
    >> string matching algorithms for a given starting point in each string are
    >> quite efficient.
    >>
    >> I think one "cheap" approach would be to keep an ordered list of
    >> digraphs or trigraphs that appear anywhere in the Name. That list gets
    >> updated whenever an item is added/removed from the data. The list would
    >> contain multiple entries consisting of the item ID and the start
    >> position of the digraph/trigraph in the Name. (Maybe the remaining
    >> length if the search keys are likely to be long so that you can
    >> eliminate embedded substrings that would extend beyond the Name length.)
    >>
    >> The data itself would be ordered in an array and the "ID" above would be
    >> an index into the storage position (either the original ID or one you
    >> create when storing the data). Your algorithm would find the starting
    >> digraph/trigraph in the list above and "walk" the entries for that
    >> digraph/trigraph. For each one it would do a lookup to get the data and
    >> use a conventional substring matching algorithm for the full search
    >> string at the indicated start point.
    >>
    >> There are even less elaborate versions. You can simply keep a list of
    >> data items that contain the starting letter of the search string
    >> somewhere and search just them. Then all you need is to store the data
    >> so you can access items by an index referenced in the starting letter list.
    >>
    >> Bottom line: you have a trade-off between how fast you want to be and
    >> how much work you want to put into storing the data items. As a fairly
    >> general rule the search will be in direct proportional to how much space
    >> you have for storing supporting infrastructure like index tables.
    >>
    >> For example, you could create 26 trees whose nodes were letters. The
    >> root node in each tree would be one of the possible first letters of the
    >> Name. Each child node would be a character that immediately followed the
    >> parent as the next letter of the Name (pretty much like a B-Tree). Each
    >> node would have a list of items that contained the substring of letters
    >> leading to that node. Then you could create a similar set of trees where
    >> the root nodes were the second letter of the name and so on. The search
    >> algorithm would "walk" the trees in each set from the node with the
    >> proper starting letter. This would be very fast but it would require
    >> substantially more storage space than the original data. [One could get
    >> even more carried away if one uses a digraph or trigraph as the root
    >> node in each tree. B-)]
    >>
    >> *************
    >> There is nothing wrong with me that could
    >> not be cured by a capful of Drano.
    >>
    >> H. S. Lahman
    >>
    >> Pathfinder Solutionshttp://www.pathfindermda.com
    >> blog:http://pathfinderpeople.blogs.com/hslahman
    >> "Model-Based Translation: The Next Step in Agile Development". Email
    >> for your copy.
    >> Pathfinder is hiring:http://www.pathfindermda.com/about_us/careers_pos3.php.
    >> (888)OOA-PATH

    >
    >
    > So I can point you to a file which is near exactly what I would be
    > searching.
    >
    > http://www.fda.gov/cder/ndc/ziptext.zip
    >
    > It is a list of medications. There are 50,000 which is a reasonable
    > representation. Look at column G in listings.txt. If you take that
    > it is 1.5MB in size.
    >
    > If I load up the items into an ArrayList and then search linearly for
    > items matching WELLBUTRIN
    >
    > then do to 1000 searches takes 9-11 seconds. So 100 searches a second
    > seems rather fast, maybe I am worrying about nothing.
    >
    > <pre>
    > // build file
    > String contents = FileUtil.getTextContents("C:\\temp\\a.txt");
    > java.util.StringTokenizer st = new
    > java.util.StringTokenizer(contents,"\n\r");
    > ArrayList<String> list = new ArrayList<String>(50000);
    > while (st.hasMoreTokens()) {
    > list.add(st.nextToken());
    > }
    >
    > // start timer
    > System.out.println(Timer.getNow());
    > String match = "WELLBUTRIN";
    > ArrayList<Integer> matches = new ArrayList<Integer>();
    > for (int i=0; i<1000; i++) {
    > matches.clear();
    > for(int j=0; j<list.size(); j++) {
    > if (list.get(j).indexOf(match) > -1) {
    > matches.add(j);
    > }
    > }
    > }
    > //stop timer
    > System.out.println(Timer.getNow());
    > //System.out.println(matches);
    > </pre>
    >


    I'll bet if you do that in a String[] it will be even faster.

    --

    Knute Johnson
    email s/nospam/knute/
    Knute Johnson, Feb 14, 2007
    #18
  19. Timasmith

    David Segall Guest

    "Jeff Higgins" <> wrote:

    >
    >Timasmith wrote:
    >> Suppose I have a list of 10,000 items that you can order, each order
    >> item has an ID and a Name.
    >>
    >> If you type in a string and I want to find every item that CONTAINS
    >> that string, what is the most efficient way?
    >>
    >> I happen to be using java but I am open to general solutions.

    >
    >http://lucene.apache.org/java/docs/

    I don't think so. From the docs "In order to prevent extremely slow
    WildcardQueries, a Wildcard term should not start with one of the
    wildcards * or ?".
    David Segall, Feb 14, 2007
    #19
  20. Timasmith

    bugbear Guest

    Ron Jeffries wrote:
    >
    > Before doing anything else, I'd suggest that you code up the simple
    > search and time it. My guess is it'll fly.


    Agreed in principle, but it depends
    what "fast enough" means.

    For human interaction 1/100 of a second is good.
    For human interaction with multiple users (e.g. webserver)
    you might want more 1/1000.

    For the evaluation loop inside a genetic algorithm
    you might be aiming for micro seconds.

    BugBear
    bugbear, Feb 14, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Brent Minder
    Replies:
    3
    Views:
    390
    Brent
    Dec 28, 2003
  2. Anders
    Replies:
    4
    Views:
    494
    Greg Burns
    Jul 19, 2004
  3. Peter
    Replies:
    1
    Views:
    362
    Steve C. Orr [MVP, MCSD]
    Nov 9, 2004
  4. H.MuthuKumaraRajan
    Replies:
    3
    Views:
    426
    H.MuthuKumaraRajan
    Feb 4, 2004
  5. xkenneth
    Replies:
    8
    Views:
    329
    Bruno Desthuilliers
    Feb 6, 2008
Loading...

Share This Page