"Fuzzy" matching of lines between CVS mailing lists

Discussion in 'Ruby' started by Phil Rhoades, Jan 29, 2008.

  1. Phil Rhoades

    Phil Rhoades Guest

    People,

    I am not sure if this is an appropriate place to ask this sort of
    question - there are probably dozens of different solutions with greatly
    varying amounts of time, effort and efficiency involved but since I like
    doing things in Ruby, I thought I would ask the gurus here:

    I periodically receive new mailing lists in CVS format and I have to
    check for duplications of individual mailing addresses in the new list
    and the current list. The problem is, the there is no common format in
    the new lists because they come from different organisations - one list
    might have all data in capital letters, another might have last name and
    only initials, another might have last name and first name, another
    might have full state names and others a two character field - there are
    lots of variations. About the only thing that can relied on (ignoring
    case) is that the last name would be the same in both lists if there is
    a duplication. If I want to pattern match from the new list to the
    existing list, I have to be fairly flexible ie it is better to get false
    positives (because they can be quickly ignored by eyeballing) than false
    negatives (someone is mailed twice in the new merged list).

    Suggestions? ideas? Should I just use the regular shell tools?

    Thanks,

    Phil.
    --
    Philip Rhoades

    Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
    GPO Box 3411
    Sydney NSW 2001
    Australia
    Fax: +61:(0)2-8221-9599
    E-mail:
     
    Phil Rhoades, Jan 29, 2008
    #1
    1. Advertising

  2. On 29.01.2008 19:03, Phil Rhoades wrote:
    > People,
    >
    > I am not sure if this is an appropriate place to ask this sort of
    > question - there are probably dozens of different solutions with greatly
    > varying amounts of time, effort and efficiency involved but since I like
    > doing things in Ruby, I thought I would ask the gurus here:
    >
    > I periodically receive new mailing lists in CVS format and I have to
    > check for duplications of individual mailing addresses in the new list
    > and the current list. The problem is, the there is no common format in
    > the new lists because they come from different organisations - one list
    > might have all data in capital letters, another might have last name and
    > only initials, another might have last name and first name, another
    > might have full state names and others a two character field - there are
    > lots of variations. About the only thing that can relied on (ignoring
    > case) is that the last name would be the same in both lists if there is
    > a duplication. If I want to pattern match from the new list to the
    > existing list, I have to be fairly flexible ie it is better to get false
    > positives (because they can be quickly ignored by eyeballing) than false
    > negatives (someone is mailed twice in the new merged list).
    >
    > Suggestions? ideas? Should I just use the regular shell tools?


    <brainstorming>
    Maybe a two step approach:

    1. normalize data (e.g. rip off all whitespace, punctuation or just keep
    all characters and digits)

    2. calculate something like the hamming distance between every two
    entries and flag those entries which have a distance less than a certain
    threshold

    Downside is that step 2 takes O(n*n)...
    </brainstorming>

    Kind regards

    robert
     
    Robert Klemme, Jan 29, 2008
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andrew McLean

    Fuzzy matching of postal addresses

    Andrew McLean, Jan 18, 2005, in forum: Python
    Replies:
    18
    Views:
    715
    Joseph Turian
    Jan 24, 2005
  2. Tim Churches

    Re: Fuzzy matching of postal addresses

    Tim Churches, Jan 18, 2005, in forum: Python
    Replies:
    0
    Views:
    481
    Tim Churches
    Jan 18, 2005
  3. Tim Churches
    Replies:
    4
    Views:
    518
    Tim Churches
    Feb 20, 2005
  4. Martin Hansen

    Algorithm for fuzzy string matching

    Martin Hansen, Mar 23, 2011, in forum: Ruby
    Replies:
    1
    Views:
    237
    Ryan Davis
    Mar 23, 2011
  5. Bobby Chamness
    Replies:
    2
    Views:
    231
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page