Can regular expressions be used to choose among several imperfectmatches?

Ted Byers · Nov 18, 2008

What I mean is this:

Imagine I have two hashes, one with a name as the key and an integer
ID as the value. The names are guaranteed to be unique and correct.
The second hash also has names as the key (and the value in this one
doesn't matter), but being manually typed they are not guaranteed to
be correct. They may even have multiple values that correspond to the
a given key in the first hash. This potential of many to one arises
due to the different ways typos (and different abbreviations) can
alter a given string.

A major complication is that, because the data in the second hash
comes from a different feed using a different protocol, it is
guaranteed that there will never be a perfect match between any key in
the first hash and any key in the second hash. The only guarantee,
regarding the data in the second hash, is that there is only one key
in the first that corresponds to the key in the second. One pattern
we see a lot is that in some cases, the name string includes
whitespace between the names provided, while in others there is no
whitespace: so FredEdwardSmith would need to be recognized as the same
as Fred Edward Smith. Another pattern includes an arbitrary number of
digits before the name, after or both. And then there issues with
different spelling conventions (e.g. color vs colour) and regular
typos (e.g. FredEdwardSmyth).

The problem is to create a hash that maps all keys in the second hash
to the ID used as the value in the first hash. This is in a context
where nothing is known until run time: at run time, both sets of data
have been loaded into a DB, and our script retreives the data from
there. This data is dynamic so there is little chance of seeing the
same data twice (but the second data feed changes much more frequently
than the first).

Now, when we actually look at the data ourselves, it is obvious which
correct name applies to the names from the second feed. Our problem
is how to make a script that is as good at seeing correct matches
between the first and second sets of data as the human eye is.

My first thought was to use regular expressions for this, but nothing
I have read so far sheds light on how to use them on imperfect data.
Are regular expressions able to deal with this, or is there a perl
package that is better suited to this problem?

Thanks

Ted

xhoster · Nov 18, 2008

Ted Byers said:
One pattern
we see a lot is that in some cases, the name string includes
whitespace between the names provided, while in others there is no
whitespace: so FredEdwardSmith would need to be recognized as the same
as Fred Edward Smith. Another pattern includes an arbitrary number of
digits before the name, after or both.

Canonicalize the data by eliminating all whitespace and leading/following
digits.

And then there issues with
different spelling conventions (e.g. color vs colour) and regular
typos (e.g. FredEdwardSmyth).

For this, maybe String::Approx or the other modules discussed in the
perldoc for String::Approx.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Can this code be used? Github repository question	0	Sep 28, 2023
Can PST files be converted to EML without Outlook?	1	Jun 3, 2026
Issue with unused constants - wanting them to be used	0	Dec 5, 2024
What programming language to choose?	4	Jul 3, 2022
Executing a second python file with one of several options at a time	0	Nov 6, 2025
Best EDB to PST converter for 2025: Which one to choose?	2	Feb 11, 2025
regular expressions and matching delimeters	17	May 21, 2014
Can EML files be converted to PST?	2	Dec 26, 2024

Can regular expressions be used to choose among several imperfectmatches?

Ted Byers

xhoster

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads