Algorithm for fuzzy string matching

Discussion in 'Ruby' started by Martin Hansen, Mar 23, 2011.

  1. Hello all,


    Here is an attempt at a naive fuzzy string matching algorithm. I would
    appreciate comments on the code - which is not as simple as I would
    like.

    I have a couple of speed improvements in mind:
    1) replace EQUAL hash with array (possibly NArray).
    2) a path can be terminated before fully explored if it can be
    calculated that it will never complete (somewhere inside in_bounds?).


    Cheers,


    Martin



    #!/usr/bin/env ruby

    require 'pp'

    VERBOSE = false
    TEST = true

    # IUPAC nucleotide pair ambiguity equivalents
    EQUAL = {
    :AA => true, :BU => true, :TH => true, :UY => true,
    :TT => true, :CB => true, :UH => true, :SC => true,
    :CC => true, :GB => true, :VA => true, :SG => true,
    :GG => true, :TB => true, :VC => true, :CS => true,
    :UU => true, :UB => true, :VG => true, :GS => true,
    :NA => true, :DA => true, :AV => true, :WA => true,
    :NT => true, :DG => true, :CV => true, :WT => true,
    :NC => true, :DT => true, :GV => true, :WU => true,
    :NG => true, :DU => true, :KG => true, :AW => true,
    :NU => true, :AD => true, :KT => true, :TW => true,
    :AN => true, :GD => true, :KU => true, :UW => true,
    :TN => true, :TD => true, :GK => true, :RA => true,
    :CN => true, :UD => true, :TK => true, :RG => true,
    :GN => true, :HA => true, :UK => true, :AR => true,
    :UN => true, :HC => true, :YC => true, :GR => true,
    :NN => true, :HT => true, :YT => true, :MA => true,
    :BC => true, :HU => true, :YU => true, :MC => true,
    :BG => true, :AH => true, :CY => true, :AM => true,
    :BT => true, :CH => true, :TY => true, :CM => true,
    }

    # Class to match two nucleotide sequences allowing for ambiguity codes
    # and a given maximum number of matches, insertions, and deletions.
    class Matcher
    attr_accessor :seq1_index, :seq2_index, :matches, :mismatches,
    :insertions, :deletions

    def initialize(seq1, seq2, max_mismatches = 0, max_insertions = 0,
    max_deletions = 0, seq1_index = 0, seq2_index = 0, matches = 0,
    mismatches = 0, insertions = 0, deletions = 0)
    @seq1 = seq1
    @seq2 = seq2
    @max_mismatches = max_mismatches
    @max_insertions = max_insertions
    @max_deletions = max_deletions
    @seq1_index = seq1_index
    @seq2_index = seq2_index
    @matches = matches
    @mismatches = mismatches
    @insertions = insertions
    @deletions = deletions
    end

    # Method to explore all paths for matching two sequences allowing for
    # a given maximum number of mismatches, insertions and deletions.
    def match
    paths = []
    paths << self

    while not paths.empty?
    pp paths if VERBOSE

    new_paths = []

    paths.each do |path|
    if path.match? # ---- Match ----
    path_matches = path.dup
    path_matches.matches += 1

    return path_matches if path_matches.ok?

    path_matches.seq1_index += 1
    path_matches.seq2_index += 1

    new_paths << path_matches if path_matches.in_bounds?
    else # ---- Mismatch
    ----
    path_mismatches = path.dup
    path_mismatches.mismatches += 1

    return path_mismatches if path_mismatches.ok?

    path_mismatches.seq1_index += 1
    path_mismatches.seq2_index += 1

    new_paths << path_mismatches if path_mismatches.in_bounds?
    end

    if path.insertions < @max_insertions # ---- Insertion
    ----
    path_insertions = path.dup
    path_insertions.insertions += 1

    return path_insertions if path_insertions.ok?

    path_insertions.seq1_index += 1

    new_paths << path_insertions if path_insertions.in_bounds?
    end

    if path.deletions < @max_deletions # ---- Deletion
    ----
    path_deletions = path.dup
    path_deletions.deletions += 1

    return path_deletions if path_deletions.ok?

    path_deletions.seq2_index += 1

    new_paths << path_deletions if path_deletions.in_bounds?
    end
    end

    paths = new_paths
    end
    end

    # Method to check if residues match.
    def match?
    if self.seq1_index < @seq1.length and self.seq2_index < @seq2.length
    EQUAL[(@seq1[self.seq1_index].upcase +
    @seq2[self.seq2_index].upcase).to_sym]
    end
    end

    # Method to check if a path is complete.
    def ok?
    if self.mismatches <= @max_mismatches and self.insertions <=
    @max_insertions and self.deletions <= @max_deletions
    if @seq1.length == self.matches + self.mismatches +
    self.insertions and
    @seq2.length == self.matches + self.mismatches + self.deletions
    pp self if VERBOSE
    true
    end
    end
    end

    # Method to check if the path is within the search space.
    def in_bounds?
    if self.seq1_index <= @seq1.length and self.seq2_index <=
    @seq2.length
    true
    else
    false
    end
    end
    end

    if VERBOSE
    m = Matcher.new("atcg", "atcgX", mismatches=0, insertions=0,
    deletions=1)
    m.match
    end

    if TEST and __FILE__ == $PROGRAM_NAME
    require "test/unit"

    class TestMatcher < Test::Unit::TestCase
    def test_perfect_match_returns_ok
    m = Matcher.new("atcg", "atcg", mismatches=0, insertions=0,
    deletions=0)
    assert_not_nil(m.match)
    end

    def test_perfect_match_with_ambiguity_returns_ok
    m = Matcher.new("atcg", "NNNN", mismatches=0, insertions=0,
    deletions=0)
    assert_not_nil(m.match)
    end

    def test_one_mismatch_with_zero_allowed_returns_nil
    m = Matcher.new("atcg", "atGg", mismatches=0, insertions=0,
    deletions=0)
    assert_equal(nil, m.match)
    end

    def test_one_mismatch_with_one_allowed_returns_ok
    m = Matcher.new("atcg", "atGg", mismatches=1, insertions=0,
    deletions=0)
    assert_not_nil(m.match)
    end

    def test_two_mismatch_with_one_allowed_returns_nil
    m = Matcher.new("atcg", "GtGg", mismatches=1, insertions=0,
    deletions=0)
    assert_equal(nil, m.match)
    end

    def test_two_mismatch_with_two_allowed_returns_ok
    m = Matcher.new("atcg", "GtGg", mismatches=2, insertions=0,
    deletions=0)
    assert_not_nil(m.match)
    end

    def test_one_insertion_with_zero_allowed_returns_nil
    m = Matcher.new("atGcg", "atcg", mismatches=0, insertions=0,
    deletions=0)
    assert_equal(nil, m.match)
    end

    def test_one_insertion_with_one_allowed_returns_ok
    m = Matcher.new("atGcg", "atcg", mismatches=0, insertions=1,
    deletions=0)
    assert_not_nil(m.match)
    end

    def test_two_insertion_with_one_allowed_returns_nil
    m = Matcher.new("atGcgC", "atcg", mismatches=0, insertions=1,
    deletions=0)
    assert_equal(nil, m.match)
    end

    def test_two_insertion_with_two_allowed_returns_ok
    m = Matcher.new("atGcgC", "atcg", mismatches=0, insertions=2,
    deletions=0)
    assert_not_nil(m.match)
    end

    def test_one_deletion_with_zero_allowed_returns_nil
    m = Matcher.new("atcg", "atcAg", mismatches=0, insertions=0,
    deletions=0)
    assert_equal(nil, m.match)
    end

    def test_one_deletion_with_one_allowed_returns_ok
    m = Matcher.new("atcg", "atcAg", mismatches=0, insertions=0,
    deletions=1)
    assert_not_nil(m.match)
    end

    def test_two_deletion_with_one_allowed_returns_nil
    m = Matcher.new("atcg", "TatcAg", mismatches=0, insertions=0,
    deletions=1)
    assert_equal(nil, m.match)
    end

    def test_two_deletion_with_two_allowed_returns_ok
    m = Matcher.new("atcg", "AatcTg", mismatches=0, insertions=0,
    deletions=2)
    assert_not_nil(m.match)
    end

    def test_one_mismatch_one_insertions_one_deletion_returns_ok
    m = Matcher.new("TaGcg", "atcgA", mismatches=1, insertions=1,
    deletions=1)
    assert_not_nil(m.match)
    end
    end
    end

    --
    Posted via http://www.ruby-forum.com/.
     
    Martin Hansen, Mar 23, 2011
    #1
    1. Advertising

  2. Martin Hansen

    Ryan Davis Guest

    On Mar 23, 2011, at 06:45 , Martin Hansen wrote:

    > Hello all,
    >=20
    >=20
    > Here is an attempt at a naive fuzzy string matching algorithm. I would
    > appreciate comments on the code - which is not as simple as I would
    > like.
    >=20
    > I have a couple of speed improvements in mind:
    > 1) replace EQUAL hash with array (possibly NArray).


    That would generally be slower, depending on use. One thing you might =
    want to do is clean up the code is match:

    EQUAL[(@seq1[self.seq1_index].upcase +
    @seq2[self.seq2_index].upcase).to_sym]

    could be:

    EQUAL[:"#{@seq1[self.seq1_index]}#{@seq2[self.seq2_index]}"]

    You'll have to change your initializer to upcase the input:

    @seq1 =3D seq1.upcase
    @seq2 =3D seq2.upcase

    Looks like you'd get a better speed up using strings as your hash keys:

    # of iterations =3D 1000000
    user system total real
    null_time 0.130000 0.000000 0.130000 ( 0.131828)
    upcase + to_sym 21.320000 0.020000 21.340000 ( 21.366015)
    interpolated sym 14.760000 0.010000 14.770000 ( 14.796624)
    interpolated str 11.290000 0.020000 11.310000 ( 11.449326)

    > 2) a path can be terminated before fully explored if it can be
    > calculated that it will never complete (somewhere inside in_bounds?).


    Dunno... but you have a good test suite, so it is probably pretty easy =
    to write a test and fix it.

    Minor cleanup:

    EQUAL =3D Hash[%w(AA BU TH UY TT CB UH SC CC GB VA SG GG TB VC CS UU UB
    VG GS NA DA AV WA NT DG CV WT NC DT GV WU NG DU KG AW
    NU AD KT TW AN GD KU UW TN TD GK RA CN UD TK RG GN HA
    UK AR UN HC YC GR NN HT YT MA BC HU YU MC BG AH CY AM
    BT CH TY CM).map { |pair| [pair.to_sym, true] }]
     
    Ryan Davis, Mar 23, 2011
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Giovanni Azua
    Replies:
    3
    Views:
    636
    Ingo R. Homann
    Aug 9, 2005
  2. Andrew McLean

    Fuzzy matching of postal addresses

    Andrew McLean, Jan 18, 2005, in forum: Python
    Replies:
    18
    Views:
    715
    Joseph Turian
    Jan 24, 2005
  3. Tim Churches

    Re: Fuzzy matching of postal addresses

    Tim Churches, Jan 18, 2005, in forum: Python
    Replies:
    0
    Views:
    481
    Tim Churches
    Jan 18, 2005
  4. Tim Churches
    Replies:
    4
    Views:
    518
    Tim Churches
    Feb 20, 2005
  5. Phil Rhoades
    Replies:
    1
    Views:
    123
    Robert Klemme
    Jan 29, 2008
Loading...

Share This Page