string matching algorithm

Discussion in 'Ruby' started by David Weldon, Nov 24, 2007.

  1. David Weldon

    David Weldon Guest

    Problem: 1 million+ strings (Set A) need to be matched with 1 million+
    substrings (Set B). For example:

    Set A =
    iliketotraveltohawaii
    travelmagazine

    Set B =
    travel
    hawaii

    A(1,2) match "travel"
    A(1) matches "hawaii"

    What is the best approach to take with this problem? I have tried using
    ferret:

    http://www.ruby-forum.com/topic/132772

    Indexing is fast, but search is very slow. I think ferret could be a
    good choice if I had the substrings split out so iliketotraveltohawaii
    -> "i like to travel to hawaii". I have a solution to this problem but
    it can't be trusted with misspellings (that's an entirely different
    forum topic). Is there something obvious that I'm missing here?
    --
    Posted via http://www.ruby-forum.com/.
     
    David Weldon, Nov 24, 2007
    #1
    1. Advertising

  2. David Weldon

    Axel Etzold Guest

    -------- Original-Nachricht --------
    > Datum: Sun, 25 Nov 2007 07:40:50 +0900
    > Von: David Weldon <>
    > An:
    > Betreff: string matching algorithm


    > Problem: 1 million+ strings (Set A) need to be matched with 1 million+
    > substrings (Set B). For example:
    >
    > Set A =
    > iliketotraveltohawaii
    > travelmagazine
    >
    > Set B =
    > travel
    > hawaii
    >
    > A(1,2) match "travel"
    > A(1) matches "hawaii"
    >
    > What is the best approach to take with this problem? I have tried using
    > ferret:
    >
    > http://www.ruby-forum.com/topic/132772
    >
    > Indexing is fast, but search is very slow. I think ferret could be a
    > good choice if I had the substrings split out so iliketotraveltohawaii
    > -> "i like to travel to hawaii". I have a solution to this problem but
    > it can't be trusted with misspellings (that's an entirely different
    > forum topic). Is there something obvious that I'm missing here?
    > --
    > Posted via http://www.ruby-forum.com/.


    Dear David,

    just curious: how fast is Array#grep ?

    a=%w[liketotraveltohawaii travelmagazine]
    b=%w[travel hawaii]

    b.each_with_index{|x,i|
    if a.grep(x)
    puts 'entry ' + i.to_s + ' in Array a matches ' + x
    end
    }

    Best regards,

    Axel
    --
    GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
    Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
     
    Axel Etzold, Nov 24, 2007
    #2
    1. Advertising

  3. David Weldon

    Axel Etzold Guest

    -------- Original-Nachricht --------
    > Datum: Sun, 25 Nov 2007 07:40:50 +0900
    > Von: David Weldon <>
    > An:
    > Betreff: string matching algorithm


    > Problem: 1 million+ strings (Set A) need to be matched with 1 million+
    > substrings (Set B). For example:
    >
    > Set A =
    > iliketotraveltohawaii
    > travelmagazine
    >
    > Set B =
    > travel
    > hawaii
    >
    > A(1,2) match "travel"
    > A(1) matches "hawaii"
    >
    > What is the best approach to take with this problem? I have tried using
    > ferret:
    >
    > http://www.ruby-forum.com/topic/132772
    >
    > Indexing is fast, but search is very slow. I think ferret could be a
    > good choice if I had the substrings split out so iliketotraveltohawaii
    > -> "i like to travel to hawaii". I have a solution to this problem but
    > it can't be trusted with misspellings (that's an entirely different
    > forum topic). Is there something obvious that I'm missing here?
    > --
    > Posted via http://www.ruby-forum.com/.


    Dear David,

    sorry, the last post wasn't correct. I mean:

    a=%w[liketotraveltohawaii travelmagazine]
    b=%w[travel hawaii]

    b.each_with_index{|x,i|
    a.each{|y|
    if y.grep(x)
    puts 'entry ' + y + ' in Array a contains ' + x
    end
    }
    }

    Best regards,

    Axel

    --
    Psssst! Schon vom neuen GMX MultiMessenger gehört?
    Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger
     
    Axel Etzold, Nov 24, 2007
    #3
  4. David Weldon

    Alex Young Guest

    David Weldon wrote:
    > Problem: 1 million+ strings (Set A) need to be matched with 1 million+
    > substrings (Set B). For example:
    >
    > Set A =
    > iliketotraveltohawaii
    > travelmagazine
    >
    > Set B =
    > travel
    > hawaii
    >
    > A(1,2) match "travel"
    > A(1) matches "hawaii"
    >
    > What is the best approach to take with this problem?


    Bloom filters? It's the fastest thing I can think of.

    --
    Alex
     
    Alex Young, Nov 25, 2007
    #4
  5. David Weldon

    David Weldon Guest

    I think I'm going to end up answering my own question here. I tried a
    bloom filter kind of approach and various grep schemes. None of which
    scaled well to large data sets. So far my best solution has been to use
    Ferret and index on trigrams instead of unigrams like I was doing
    before. That sped up my search by ~100x. I'm open to other ideas if
    anyone has them, but for now this should be fast enough.
    --
    Posted via http://www.ruby-forum.com/.
     
    David Weldon, Nov 25, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dan Peder Eriksen

    Maximum Weighted Matching Algorithm

    Dan Peder Eriksen, Aug 17, 2003, in forum: Java
    Replies:
    3
    Views:
    1,140
    Dan Peder Eriksen
    Aug 17, 2003
  2. =?ISO-8859-1?Q?Martin_J=F8rgensen?=
    Replies:
    5
    Views:
    1,302
    =?ISO-8859-1?Q?Martin_J=F8rgensen?=
    May 6, 2006
  3. Martin Hansen

    Algorithm for fuzzy string matching

    Martin Hansen, Mar 23, 2011, in forum: Ruby
    Replies:
    1
    Views:
    237
    Ryan Davis
    Mar 23, 2011
  4. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    237
    Marc Bissonnette
    Jan 13, 2004
  5. Bobby Chamness
    Replies:
    2
    Views:
    231
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page