"Fuzzy" matching of lines between CVS mailing lists

Phil Rhoades · Jan 29, 2008

People,

I am not sure if this is an appropriate place to ask this sort of
question - there are probably dozens of different solutions with greatly
varying amounts of time, effort and efficiency involved but since I like
doing things in Ruby, I thought I would ask the gurus here:

I periodically receive new mailing lists in CVS format and I have to
check for duplications of individual mailing addresses in the new list
and the current list. The problem is, the there is no common format in
the new lists because they come from different organisations - one list
might have all data in capital letters, another might have last name and
only initials, another might have last name and first name, another
might have full state names and others a two character field - there are
lots of variations. About the only thing that can relied on (ignoring
case) is that the last name would be the same in both lists if there is
a duplication. If I want to pattern match from the new list to the
existing list, I have to be fairly flexible ie it is better to get false
positives (because they can be quickly ignored by eyeballing) than false
negatives (someone is mailed twice in the new merged list).

Suggestions? ideas? Should I just use the regular shell tools?

Thanks,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61

0)2-8221-9599
E-mail: (e-mail address removed)

Robert Klemme · Jan 29, 2008

People,

I am not sure if this is an appropriate place to ask this sort of
question - there are probably dozens of different solutions with greatly
varying amounts of time, effort and efficiency involved but since I like
doing things in Ruby, I thought I would ask the gurus here:

I periodically receive new mailing lists in CVS format and I have to
check for duplications of individual mailing addresses in the new list
and the current list. The problem is, the there is no common format in
the new lists because they come from different organisations - one list
might have all data in capital letters, another might have last name and
only initials, another might have last name and first name, another
might have full state names and others a two character field - there are
lots of variations. About the only thing that can relied on (ignoring
case) is that the last name would be the same in both lists if there is
a duplication. If I want to pattern match from the new list to the
existing list, I have to be fairly flexible ie it is better to get false
positives (because they can be quickly ignored by eyeballing) than false
negatives (someone is mailed twice in the new merged list).

Suggestions? ideas? Should I just use the regular shell tools?

<brainstorming>
Maybe a two step approach:

1. normalize data (e.g. rip off all whitespace, punctuation or just keep
all characters and digits)

2. calculate something like the hamming distance between every two
entries and flag those entries which have a distance less than a certain
threshold

Downside is that step 2 takes O(n*n)...
</brainstorming>

Kind regards

robert

Howto get array.agrep (NOT array.grep)	11	Apr 26, 2008
Multiline (block) CSV file processing	9	Jan 10, 2008
Ruby equivalent to "exec > $logfile 2>&1" in sh script?	2	Dec 1, 2006
Curve fitting to data	10	Dec 16, 2007
v1.9 -rprofile -rdebug errors	10	Jun 25, 2008
Retrieving PID running time	7	Dec 8, 2005
A true Ruby compiler (for Linux)	6	Dec 5, 2006
Lambda calculus & functional programming - the view from Ruby	4	Jun 28, 2008

"Fuzzy" matching of lines between CVS mailing lists

Phil Rhoades

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads