Most efficient algorithm for matching

T

Timasmith

Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.

If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?

I happen to be using java but I am open to general solutions.

Obviously I could just loop through the list and do a match on each
string but I cant help thinking one could be more efficient thatn
O(n).

A traditional database btree index doesnt help, but I think search
engines must use something fast.

thanks

Tim
 
K

Knute Johnson

Timasmith said:
Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.

If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?

I happen to be using java but I am open to general solutions.

Obviously I could just loop through the list and do a match on each
string but I cant help thinking one could be more efficient thatn
O(n).

A traditional database btree index doesnt help, but I think search
engines must use something fast.

thanks

Tim

If your data isn't sorted just looping through it is the fastest way.
 
T

Timasmith

Timasmithwrote:





If your data isn't sorted just looping through it is the fastest way.

--

Knute Johnson
email s/nospam/knute/- Hide quoted text -

- Show quoted text -

Maybe. What if I constrained the search to say that the string must
have a least 2 characters. So I guess you could take all 10,000
entries, determine all of the sequences and put those into a
hashtable. So if I search on chair it would find all the indexed
entries with ch in the them, then look through the results. If <5% of
the entries have ch in them, I would imagine that would help.
 
E

Eric Sosman

Timasmith said:
Maybe. What if I constrained the search to say that the string must
have a least 2 characters. So I guess you could take all 10,000
entries, determine all of the sequences and put those into a
hashtable. So if I search on chair it would find all the indexed
entries with ch in the them, then look through the results. If <5% of
the entries have ch in them, I would imagine that would help.

If you're doing just a few searches, a straight-line pass
over all the data is probably best.

If you're doing many searches, it may make sense to invest
a fairly large amount of work in initialization to make the
searches themselves run faster. Here's one way:

Map<String,Set<Item>> map = new HashMap<String,Set<Item>>();
for (Item item : itemCollection) {
for (String key : item.setOfAllKeys()) {
Set<Item> set = map.get(key);
if (set == null) {
set = new Set<Item>();
map.put(key, set);
}
set.add(item);
}
}

That is, you have a map that associates each substring with a
set of all the Items having that substring. Each Item appears
in as many sets as it has distinct substrings.

It may be desirable to endow setOfAllKeys() with a little
more intelligence than a simple brute-force decomposition would
exhibit. If one of your Items is "LEFT HANDED MONKEY WRENCH"
there's probably not much to be gained by indexing it under "FT"
and "D M" and "AND" -- yet indexing under "WRENCHES" might be
a good idea, even though it's not a substring of the name.

See also "The Art of Computer Programming, Volume III:
Sorting and Searching" by D.E. Knuth. IIRC, one approach he
describes for this kind of search is to start with "LEFT HANDED
MONKEY WRENCH" and enter it in all of the lists LE, EF, FT, ...
DE, ED. Then given the search key "DED MONK" you would form
the intersection of the lists for DE, ED, ..., NK (keeping the
lists sorted would be helpful, or you could use bit vectors)
and finish by inspecting only those Items in the intersection.
 
K

Knute Johnson

Timasmith said:
Maybe. What if I constrained the search to say that the string must
have a least 2 characters. So I guess you could take all 10,000
entries, determine all of the sequences and put those into a
hashtable. So if I search on chair it would find all the indexed
entries with ch in the them, then look through the results. If <5% of
the entries have ch in them, I would imagine that would help.

With 10,000 items, I doubt very much that you can beat a linear search
by very much with a complicated data structure. With 10,000,00 I can
see where it would be a different story. A linear search is pretty easy
to code. 10,000 classes of 1000 bytes only takes up 10mb. Easy to
store in memory. 10 million items on the other hand would probably have
to be kept on some kind of permanent storage. Or you might want to use
some real database manager instead and use its searching capabilities.
 
B

bugbear

Timasmith said:
Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.

If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?

Just to clarify; you wish to detect your target string (e.g. "har")
within the Name of your item object?

e.g. "harold" is a hit, "harry morse" is a hit, "mr haroldson" is a hit,
"nigel smith" is a miss?

BugBear
 
T

Timasmith

Timasmithwrote:


Just to clarify; you wish to detect your target string (e.g. "har")
within the Name of your item object?

e.g. "harold" is a hit, "harry morse" is a hit, "mr haroldson" is a hit,
"nigel smith" is a miss?

BugBear


Correct.
 
B

bugbear

Timasmith said:

OK. Any other constraints on alphabet (i.e. number of distinct characters
allowed for use in "Name"), or search targets (whole words?),
number of insertions to the item list compared with number of retrievals (searches)?

BugBear
 
J

Jeff Higgins

Timasmith said:
Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.

If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?

I happen to be using java but I am open to general solutions.
http://lucene.apache.org/java/docs/


Obviously I could just loop through the list and do a match on each
string but I cant help thinking one could be more efficient thatn
O(n).

A traditional database btree index doesnt help, but I think search
engines must use something fast.

thanks

Tim
 
C

Chris Uppal

Timasmith said:
Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.

If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?

Efficient text searching is a /big/ subject. No, let me be more precise: it's
/HUGE/ subject.

If your list of items is reasonably short, and the items themselves are
reasonably short, then you can do yourself a big favour by ignoring efficiency
and just searching linearly. A lot depends on the application, but I'm not
convinced that 10K items is enough to justify heavier-weight solutions.

Next up, there are efficient text searching algorithms, such as Boyer-Moore,
and the interesting newish breed of "bit-parallel" or "bitmap" searching
algorithms. If you have enough text, and if the overhead (in time, space, or
convenience) of maintaining an index is unacceptable, then looking into those
algorithms would be sensible. Note that there variants on these algorithms
which do approximate matching. You might look for "agrep" for instance.

Also Wikipeadia has quite a few pages on text searching.
http://en.wikipedia.org/wiki/Category:Search_algorithms

Lastly, if you need greater speed than such methods can provide (and search
engines /'definitely/ do ;-) then you'd need to use some sort of indexing.

The simplest way to approach that is to ensure that your database has something
like "full text search and indexing" in its feature set -- and use that.

Alternatively, there are off-the-shelf packages for bulk test searching. Jeff
Higgins has already pointed you towards Apache Lucene, which is one such
package.

-- chris
 
H

H. S. Lahman

Responding to Timasmith...
Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.

The key will be ordering, but for just 10K items one probably doesn't
want to get too elaborate.
If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?

I assume you mean in the Name rather than the ID. I also assume that the
data gets stored once but is searched often.

Is there a limit on the lengths of the Names? What is the average
length? The answers to these questions will drive tailoring the
solutions below to trade-off speed vs. effort. Can you define the ID
(e.g., an index for the order the data items are stored)?
I happen to be using java but I am open to general solutions.

Obviously I could just loop through the list and do a match on each
string but I cant help thinking one could be more efficient thatn
O(n).

I would be inclined to make use of a couple of practical
characteristics. In searching ASCII keys most rejections will occur in
the first few characters. (It has been many, many moons since I've done
searches so I forget the actual numbers for the distribution.) Also,
string matching algorithms for a given starting point in each string are
quite efficient.

I think one "cheap" approach would be to keep an ordered list of
digraphs or trigraphs that appear anywhere in the Name. That list gets
updated whenever an item is added/removed from the data. The list would
contain multiple entries consisting of the item ID and the start
position of the digraph/trigraph in the Name. (Maybe the remaining
length if the search keys are likely to be long so that you can
eliminate embedded substrings that would extend beyond the Name length.)

The data itself would be ordered in an array and the "ID" above would be
an index into the storage position (either the original ID or one you
create when storing the data). Your algorithm would find the starting
digraph/trigraph in the list above and "walk" the entries for that
digraph/trigraph. For each one it would do a lookup to get the data and
use a conventional substring matching algorithm for the full search
string at the indicated start point.

There are even less elaborate versions. You can simply keep a list of
data items that contain the starting letter of the search string
somewhere and search just them. Then all you need is to store the data
so you can access items by an index referenced in the starting letter list.

Bottom line: you have a trade-off between how fast you want to be and
how much work you want to put into storing the data items. As a fairly
general rule the search will be in direct proportional to how much space
you have for storing supporting infrastructure like index tables.

For example, you could create 26 trees whose nodes were letters. The
root node in each tree would be one of the possible first letters of the
Name. Each child node would be a character that immediately followed the
parent as the next letter of the Name (pretty much like a B-Tree). Each
node would have a list of items that contained the substring of letters
leading to that node. Then you could create a similar set of trees where
the root nodes were the second letter of the name and so on. The search
algorithm would "walk" the trees in each set from the node with the
proper starting letter. This would be very fast but it would require
substantially more storage space than the original data. [One could get
even more carried away if one uses a digraph or trigraph as the root
node in each tree. B-)]


*************
There is nothing wrong with me that could
not be cured by a capful of Drano.

H. S. Lahman
(e-mail address removed)
Pathfinder Solutions
http://www.pathfindermda.com
blog: http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
(e-mail address removed) for your copy.
Pathfinder is hiring:
http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH
 
T

Timasmith

Timasmithwrote:


OK. Any other constraints on alphabet (i.e. number of distinct characters
allowed for use in "Name"), or search targets (whole words?),
number of insertions to the item list compared with number of retrievals (searches)?

BugBear

No constraints on alphabet, most of the search targets are single
words from 3-30 characters.

Insertions is rare, the list does not change from day to day,
retreivals is measured in thousands or even tens of thousands per hour.
 
D

Daniel Pitts

Maybe. What if I constrained the search to say that the string must
have a least 2 characters. So I guess you could take all 10,000
entries, determine all of the sequences and put those into a
hashtable. So if I search on chair it would find all the indexed
entries with ch in the them, then look through the results. If <5% of
the entries have ch in them, I would imagine that would help.


You could also look into a third-party indexing library.

Check out: Lucene.
or for a web service: Solr

Lucene was developed by an (ex?) Google employee.
Solr was developed at, and is used extensively at, CNET Networks.
 
T

Timasmith

Responding toTimasmith...
Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.

The key will be ordering, but for just 10K items one probably doesn't
want to get too elaborate.
If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?

I assume you mean in the Name rather than the ID. I also assume that the
data gets stored once but is searched often.

Is there a limit on the lengths of the Names? What is the average
length? The answers to these questions will drive tailoring the
solutions below to trade-off speed vs. effort. Can you define the ID
(e.g., an index for the order the data items are stored)?


I happen to be using java but I am open to general solutions.
Obviously I could just loop through the list and do a match on each
string but I cant help thinking one could be more efficient thatn
O(n).

I would be inclined to make use of a couple of practical
characteristics. In searching ASCII keys most rejections will occur in
the first few characters. (It has been many, many moons since I've done
searches so I forget the actual numbers for the distribution.) Also,
string matching algorithms for a given starting point in each string are
quite efficient.

I think one "cheap" approach would be to keep an ordered list of
digraphs or trigraphs that appear anywhere in the Name. That list gets
updated whenever an item is added/removed from the data. The list would
contain multiple entries consisting of the item ID and the start
position of the digraph/trigraph in the Name. (Maybe the remaining
length if the search keys are likely to be long so that you can
eliminate embedded substrings that would extend beyond the Name length.)

The data itself would be ordered in an array and the "ID" above would be
an index into the storage position (either the original ID or one you
create when storing the data). Your algorithm would find the starting
digraph/trigraph in the list above and "walk" the entries for that
digraph/trigraph. For each one it would do a lookup to get the data and
use a conventional substring matching algorithm for the full search
string at the indicated start point.

There are even less elaborate versions. You can simply keep a list of
data items that contain the starting letter of the search string
somewhere and search just them. Then all you need is to store the data
so you can access items by an index referenced in the starting letter list.

Bottom line: you have a trade-off between how fast you want to be and
how much work you want to put into storing the data items. As a fairly
general rule the search will be in direct proportional to how much space
you have for storing supporting infrastructure like index tables.

For example, you could create 26 trees whose nodes were letters. The
root node in each tree would be one of the possible first letters of the
Name. Each child node would be a character that immediately followed the
parent as the next letter of the Name (pretty much like a B-Tree). Each
node would have a list of items that contained the substring of letters
leading to that node. Then you could create a similar set of trees where
the root nodes were the second letter of the name and so on. The search
algorithm would "walk" the trees in each set from the node with the
proper starting letter. This would be very fast but it would require
substantially more storage space than the original data. [One could get
even more carried away if one uses a digraph or trigraph as the root
node in each tree. B-)]

*************
There is nothing wrong with me that could
not be cured by a capful of Drano.

H. S. Lahman
(e-mail address removed)
Pathfinder Solutionshttp://www.pathfindermda.com
blog:http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
(e-mail address removed) for your copy.
Pathfinder is hiring:http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH


So I can point you to a file which is near exactly what I would be
searching.

http://www.fda.gov/cder/ndc/ziptext.zip

It is a list of medications. There are 50,000 which is a reasonable
representation. Look at column G in listings.txt. If you take that
it is 1.5MB in size.

If I load up the items into an ArrayList and then search linearly for
items matching WELLBUTRIN

then do to 1000 searches takes 9-11 seconds. So 100 searches a second
seems rather fast, maybe I am worrying about nothing.

<pre>
// build file
String contents = FileUtil.getTextContents("C:\\temp\\a.txt");
java.util.StringTokenizer st = new
java.util.StringTokenizer(contents,"\n\r");
ArrayList<String> list = new ArrayList<String>(50000);
while (st.hasMoreTokens()) {
list.add(st.nextToken());
}

// start timer
System.out.println(Timer.getNow());
String match = "WELLBUTRIN";
ArrayList<Integer> matches = new ArrayList<Integer>();
for (int i=0; i<1000; i++) {
matches.clear();
for(int j=0; j<list.size(); j++) {
if (list.get(j).indexOf(match) > -1) {
matches.add(j);
}
}
}
//stop timer
System.out.println(Timer.getNow());
//System.out.println(matches);
</pre>
 
C

Christian

Timasmith said:

search for ukkonen's algorithm ...
its an algorithm for substring trees

building is in O(n)
storage needed in O(n)
and searching in O(1)
that may fit your needs ...
 
R

Ron Jeffries

Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.

If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?

I happen to be using java but I am open to general solutions.

Obviously I could just loop through the list and do a match on each
string but I cant help thinking one could be more efficient thatn
O(n).

A traditional database btree index doesnt help, but I think search
engines must use something fast.

Before doing anything else, I'd suggest that you code up the simple
search and time it. My guess is it'll fly.

Ron Jeffries
www.XProgramming.com
 
K

Knute Johnson

Timasmith said:
Responding toTimasmith...
Suppose I have a list of 10,000 items that you can order, each order
item has an ID and a Name.
The key will be ordering, but for just 10K items one probably doesn't
want to get too elaborate.
If you type in a string and I want to find every item that CONTAINS
that string, what is the most efficient way?
I assume you mean in the Name rather than the ID. I also assume that the
data gets stored once but is searched often.

Is there a limit on the lengths of the Names? What is the average
length? The answers to these questions will drive tailoring the
solutions below to trade-off speed vs. effort. Can you define the ID
(e.g., an index for the order the data items are stored)?


I happen to be using java but I am open to general solutions.
Obviously I could just loop through the list and do a match on each
string but I cant help thinking one could be more efficient thatn
O(n).
I would be inclined to make use of a couple of practical
characteristics. In searching ASCII keys most rejections will occur in
the first few characters. (It has been many, many moons since I've done
searches so I forget the actual numbers for the distribution.) Also,
string matching algorithms for a given starting point in each string are
quite efficient.

I think one "cheap" approach would be to keep an ordered list of
digraphs or trigraphs that appear anywhere in the Name. That list gets
updated whenever an item is added/removed from the data. The list would
contain multiple entries consisting of the item ID and the start
position of the digraph/trigraph in the Name. (Maybe the remaining
length if the search keys are likely to be long so that you can
eliminate embedded substrings that would extend beyond the Name length.)

The data itself would be ordered in an array and the "ID" above would be
an index into the storage position (either the original ID or one you
create when storing the data). Your algorithm would find the starting
digraph/trigraph in the list above and "walk" the entries for that
digraph/trigraph. For each one it would do a lookup to get the data and
use a conventional substring matching algorithm for the full search
string at the indicated start point.

There are even less elaborate versions. You can simply keep a list of
data items that contain the starting letter of the search string
somewhere and search just them. Then all you need is to store the data
so you can access items by an index referenced in the starting letter list.

Bottom line: you have a trade-off between how fast you want to be and
how much work you want to put into storing the data items. As a fairly
general rule the search will be in direct proportional to how much space
you have for storing supporting infrastructure like index tables.

For example, you could create 26 trees whose nodes were letters. The
root node in each tree would be one of the possible first letters of the
Name. Each child node would be a character that immediately followed the
parent as the next letter of the Name (pretty much like a B-Tree). Each
node would have a list of items that contained the substring of letters
leading to that node. Then you could create a similar set of trees where
the root nodes were the second letter of the name and so on. The search
algorithm would "walk" the trees in each set from the node with the
proper starting letter. This would be very fast but it would require
substantially more storage space than the original data. [One could get
even more carried away if one uses a digraph or trigraph as the root
node in each tree. B-)]

*************
There is nothing wrong with me that could
not be cured by a capful of Drano.

H. S. Lahman
(e-mail address removed)
Pathfinder Solutionshttp://www.pathfindermda.com
blog:http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
(e-mail address removed) for your copy.
Pathfinder is hiring:http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH


So I can point you to a file which is near exactly what I would be
searching.

http://www.fda.gov/cder/ndc/ziptext.zip

It is a list of medications. There are 50,000 which is a reasonable
representation. Look at column G in listings.txt. If you take that
it is 1.5MB in size.

If I load up the items into an ArrayList and then search linearly for
items matching WELLBUTRIN

then do to 1000 searches takes 9-11 seconds. So 100 searches a second
seems rather fast, maybe I am worrying about nothing.

<pre>
// build file
String contents = FileUtil.getTextContents("C:\\temp\\a.txt");
java.util.StringTokenizer st = new
java.util.StringTokenizer(contents,"\n\r");
ArrayList<String> list = new ArrayList<String>(50000);
while (st.hasMoreTokens()) {
list.add(st.nextToken());
}

// start timer
System.out.println(Timer.getNow());
String match = "WELLBUTRIN";
ArrayList<Integer> matches = new ArrayList<Integer>();
for (int i=0; i<1000; i++) {
matches.clear();
for(int j=0; j<list.size(); j++) {
if (list.get(j).indexOf(match) > -1) {
matches.add(j);
}
}
}
//stop timer
System.out.println(Timer.getNow());
//System.out.println(matches);
</pre>

I'll bet if you do that in a String[] it will be even faster.
 
B

bugbear

Ron said:
Before doing anything else, I'd suggest that you code up the simple
search and time it. My guess is it'll fly.

Agreed in principle, but it depends
what "fast enough" means.

For human interaction 1/100 of a second is good.
For human interaction with multiple users (e.g. webserver)
you might want more 1/1000.

For the evaluation loop inside a genetic algorithm
you might be aiming for micro seconds.

BugBear
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top