Can regular expressions be used to choose among several imperfectmatches?

T

Ted Byers

What I mean is this:

Imagine I have two hashes, one with a name as the key and an integer
ID as the value. The names are guaranteed to be unique and correct.
The second hash also has names as the key (and the value in this one
doesn't matter), but being manually typed they are not guaranteed to
be correct. They may even have multiple values that correspond to the
a given key in the first hash. This potential of many to one arises
due to the different ways typos (and different abbreviations) can
alter a given string.

A major complication is that, because the data in the second hash
comes from a different feed using a different protocol, it is
guaranteed that there will never be a perfect match between any key in
the first hash and any key in the second hash. The only guarantee,
regarding the data in the second hash, is that there is only one key
in the first that corresponds to the key in the second. One pattern
we see a lot is that in some cases, the name string includes
whitespace between the names provided, while in others there is no
whitespace: so FredEdwardSmith would need to be recognized as the same
as Fred Edward Smith. Another pattern includes an arbitrary number of
digits before the name, after or both. And then there issues with
different spelling conventions (e.g. color vs colour) and regular
typos (e.g. FredEdwardSmyth).

The problem is to create a hash that maps all keys in the second hash
to the ID used as the value in the first hash. This is in a context
where nothing is known until run time: at run time, both sets of data
have been loaded into a DB, and our script retreives the data from
there. This data is dynamic so there is little chance of seeing the
same data twice (but the second data feed changes much more frequently
than the first).

Now, when we actually look at the data ourselves, it is obvious which
correct name applies to the names from the second feed. Our problem
is how to make a script that is as good at seeing correct matches
between the first and second sets of data as the human eye is.

My first thought was to use regular expressions for this, but nothing
I have read so far sheds light on how to use them on imperfect data.
Are regular expressions able to deal with this, or is there a perl
package that is better suited to this problem?

Thanks

Ted
 
X

xhoster

Ted Byers said:
One pattern
we see a lot is that in some cases, the name string includes
whitespace between the names provided, while in others there is no
whitespace: so FredEdwardSmith would need to be recognized as the same
as Fred Edward Smith. Another pattern includes an arbitrary number of
digits before the name, after or both.

Canonicalize the data by eliminating all whitespace and leading/following
digits.
And then there issues with
different spelling conventions (e.g. color vs colour) and regular
typos (e.g. FredEdwardSmyth).

For this, maybe String::Approx or the other modules discussed in the
perldoc for String::Approx.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,056
Messages
2,570,443
Members
47,091
Latest member
IsaacLuna

Latest Threads

Top