string matching algorithm

David Weldon · Nov 24, 2007

Problem: 1 million+ strings (Set A) need to be matched with 1 million+
substrings (Set B). For example:

Set A =
iliketotraveltohawaii
travelmagazine

Set B =
travel
hawaii

A(1,2) match "travel"
A(1) matches "hawaii"

What is the best approach to take with this problem? I have tried using
ferret:

http://www.ruby-forum.com/topic/132772

Indexing is fast, but search is very slow. I think ferret could be a
good choice if I had the substrings split out so iliketotraveltohawaii
-> "i like to travel to hawaii". I have a solution to this problem but
it can't be trusted with misspellings (that's an entirely different
forum topic). Is there something obvious that I'm missing here?

Axel Etzold · Nov 24, 2007

-------- Original-Nachricht --------

Datum: Sun, 25 Nov 2007 07:40:50 +0900
Von: David Weldon <[email protected]>
An: (e-mail address removed)
Betreff: string matching algorithm

Problem: 1 million+ strings (Set A) need to be matched with 1 million+
substrings (Set B). For example:

Set A =
iliketotraveltohawaii
travelmagazine

Set B =
travel
hawaii

A(1,2) match "travel"
A(1) matches "hawaii"

What is the best approach to take with this problem? I have tried using
ferret:

http://www.ruby-forum.com/topic/132772

Indexing is fast, but search is very slow. I think ferret could be a
good choice if I had the substrings split out so iliketotraveltohawaii
-> "i like to travel to hawaii". I have a solution to this problem but
it can't be trusted with misspellings (that's an entirely different
forum topic). Is there something obvious that I'm missing here?

Dear David,

just curious: how fast is Array#grep ?

a=%w[liketotraveltohawaii travelmagazine]
b=%w[travel hawaii]

b.each_with_index{|x,i|
if a.grep(x)
puts 'entry ' + i.to_s + ' in Array a matches ' + x
end
}

Best regards,

Axel

Axel Etzold · Nov 24, 2007

-------- Original-Nachricht --------

Datum: Sun, 25 Nov 2007 07:40:50 +0900
Von: David Weldon <[email protected]>
An: (e-mail address removed)
Betreff: string matching algorithm

Problem: 1 million+ strings (Set A) need to be matched with 1 million+
substrings (Set B). For example:

Set A =
iliketotraveltohawaii
travelmagazine

Set B =
travel
hawaii

A(1,2) match "travel"
A(1) matches "hawaii"

What is the best approach to take with this problem? I have tried using
ferret:

http://www.ruby-forum.com/topic/132772

Indexing is fast, but search is very slow. I think ferret could be a
good choice if I had the substrings split out so iliketotraveltohawaii
-> "i like to travel to hawaii". I have a solution to this problem but
it can't be trusted with misspellings (that's an entirely different
forum topic). Is there something obvious that I'm missing here?

Dear David,

sorry, the last post wasn't correct. I mean:

a=%w[liketotraveltohawaii travelmagazine]
b=%w[travel hawaii]

b.each_with_index{|x,i|
a.each{|y|
if y.grep(x)
puts 'entry ' + y + ' in Array a contains ' + x
end
}
}

Best regards,

Axel

Alex Young · Nov 25, 2007

David said:
Problem: 1 million+ strings (Set A) need to be matched with 1 million+
substrings (Set B). For example:

Set A =
iliketotraveltohawaii
travelmagazine

Set B =
travel
hawaii

A(1,2) match "travel"
A(1) matches "hawaii"

What is the best approach to take with this problem?

Bloom filters? It's the fastest thing I can think of.

David Weldon · Nov 25, 2007

I think I'm going to end up answering my own question here. I tried a
bloom filter kind of approach and various grep schemes. None of which
scaled well to large data sets. So far my best solution has been to use
Ferret and index on trigrams instead of unigrams like I was doing
before. That sped up my search by ~100x. I'm open to other ideas if
anyone has them, but for now this should be fast enough.

Algorithm	1	Dec 15, 2019
Algorithm for fuzzy string matching	1	Mar 23, 2011
[newbie] Recursive algorithm - review	5	Jan 4, 2014
Idk need help in editing this source code	0	Nov 5, 2022
[newbie] Recursive algorithm - review	5	Jan 4, 2014
Regexp, matching only the content within parentheses	11	Mar 24, 2011
String matching/comparing, statistical similarity	3	Jul 8, 2008
Count substrings in string, scan too slow	17	Jun 24, 2010

string matching algorithm

David Weldon

Axel Etzold

Axel Etzold

Alex Young

David Weldon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads