Python's re module and genealogy problem

BrJohan · Jun 11, 2014

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build
regular expressions, each of which is supposed to match a number of
similar names.

I guess that there will be a few hundred such regular expressions
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

Robert Kern · Jun 11, 2014

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in order to
match names even if they are spelled differently, I will build regular
expressions, each of which is supposed to match a number of similar names.

I guess that there will be a few hundred such regular expressions covering most
popular names.

Now, my problem: Is there a way to decide whether any two - or more - of those
regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least one string
matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

And if that isn't the best straight line for the old saying, I don't know what is.

http://en.wikiquote.org/wiki/Jamie_Zawinski

Anyways, to your new problem, yes it's possible. Search for "regular expression
intersection" for possible approaches. You will probably have to translate the
regular expression to a different formalism or at least a different library to
implement this.

Consider just listing out the different possibilities. All of your regexes
should be "well-behaved" given the constraints of the domain (tightly bounded,
at least). There are tools that help generate matching strings from a Python
regex. This will help you QA your regexes, too, to be sure that they match what
you expect them to and not match non-names.

https://github.com/asciimoo/exrex

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Thomas Rachel · Jun 11, 2014

Am 11.06.2014 14:23 schrieb BrJohan:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

Just a feeling-based statement: I don't think that is easily possible.

Every RE can potentially match an infinite number of statements.

Just have a look at

re1 = re.compile('A43.*')
re2 = re.compile('.*[0-9][A-F]')

It can easily seen that the area these REs work on is different; they
are orthogonal.

So there is an infinite number of strings these REs match.

Thomas

Mark H Harris · Jun 11, 2014

Anyways, to your new problem, yes it's possible. Search for "regular
expression intersection" for possible approaches.

I agree, I would not use a decision (decision tree) but would consider
a set of filters from most specific to least specific.

marcus

Michael Torrie · Jun 11, 2014

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build
regular expressions, each of which is supposed to match a number of
similar names.

I guess that there will be a few hundred such regular expressions
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

You might want to search for fuzzy matching algorithms. Years ago, there
was an algorithm called soundex that would generate fuzzy fingerprints
for words that would hide differences in spelling, etc. Unfortunately
such an algorithm would be language dependent. The problem you are
trying to solve is one of those very hard problems in computers and math.

Nick Cash · Jun 11, 2014

You might want to search for fuzzy matching algorithms. Years ago, there
was an algorithm called soundex that would generate fuzzy fingerprints
for words that would hide differences in spelling, etc. Unfortunately
such an algorithm would be language dependent. The problem you are
trying to solve is one of those very hard problems in computers and math.

Soundex is actually not horrible, but it is definitely only for English
names. Newer variants of Metaphone
(http://en.wikipedia.org/wiki/Metaphone) are significantly better, and
support quite a few other languages. Either one would most likely be
better than the regex approach.

Side note: if your data happens to be in MySQL then it has a builtin
"sounds_like()" function that compares strings using soundex.

Simon Ward · Jun 11, 2014

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build

regular expressions, each of which is supposed to match a number of
similar names.

As has been mentioned, you probably want to look at fuzzy matching algorithms rather than aiming at regular expressions, although a quick search suggests there has been some work on fuzzy matching with regular expressions[1].

Now, my problem: Is there a way to decide whether any two - or more -
of
those regular expressions will match the same string?

If your regexes are truly regular expressions (see [2]*) then they represent regular languages[3], which are really sets. The intersection of these, is another regular language. If you test the string against this it will also match both original languages.

(*this only mentions back references, but I think the look-ahead/behind assertions are also non-regular)

[1]: http://laurikari.net/tre/about/
[2]: https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
[3]: https://en.wikipedia.org/wiki/Regular_language

Simon

Vlastimil Brom · Jun 11, 2014

2014-06-11 14:23 GMT+02:00 BrJohan said:
For some genealogical purposes I consider using Python's re module.
...

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least one
string matching both of those regular expressions, can be constructed?

Hi,
i guess, you could reuse some available generators for strings
matching a given regular expression, see e.g.:
http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/
for example a pyparsing recipe:
http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/5006339#5006339

which might be general enough for your needs - of course, you cannot
use unbound quantifiers, backreferences, etc.

Then you can test for identical strings in the generated outputs -
e.g. using the set(...) and its intersection method.

You might also check a much more powerful regex library
https://pypi.python.org/pypi/regex

which, beyond other features, also supports the mentioned fuzzy matches, cf.

regex.findall(r"\bSm(?:ith){e<3}\b", "Smith Smithe Smyth Smythe Smijth") ['Smith', 'Smithe', 'Smyth', 'Smythe', 'Smijth']

Click to expand...

Click to expand...

(but, of course, you will have to be careful with this feature in
order to reduce false positives)

hth,
vbr

BrJohan · Jun 13, 2014

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build
regular expressions, each of which is supposed to match a number of
similar names.

I guess that there will be a few hundred such regular expressions
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

Thank you all for valuable input and interesting thoughts.

After having reconsidered my problem, it might be better to approach it
a little differently.

Either to state the regexps simply like:
"(Kristina)|(Christina)|(Cristine)|(Kristine)"
instead of "((K|(Ch))ristina)|([CK]ristine)"

Or to put the namevariants in some sequence of sets having elements like:
("Kristina", "Christina", "Cristine", "Kristine")
Matching is then just applying the 'in' operator.

I see two distinct advantages.
1. Readability and maintainability
2. Any namevariant occurring in just one regexp or set means no risk of
erroneous matching.

Comments?

Peter Otten · Jun 13, 2014

BrJohan said:
For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build
regular expressions, each of which is supposed to match a number of
similar names.

I guess that there will be a few hundred such regular expressions
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be
constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

Click to expand...

Thank you all for valuable input and interesting thoughts.

After having reconsidered my problem, it might be better to approach it
a little differently.

Either to state the regexps simply like:
"(Kristina)|(Christina)|(Cristine)|(Kristine)"
instead of "((K|(Ch))ristina)|([CK]ristine)"

Or to put the namevariants in some sequence of sets having elements like:
("Kristina", "Christina", "Cristine", "Kristine")
Matching is then just applying the 'in' operator.

I see two distinct advantages.
1. Readability and maintainability
2. Any namevariant occurring in just one regexp or set means no risk of
erroneous matching.

Comments?

I like the simple variant

kristinas = ("Kristina", "Christina", "Cristine", "Kristine")

But instead of matching with "in" you could build a dict that maps the name
variants to a normalised name

normalized_names = {
"Kristina": "Kristina",
"Christina": "Kristina",
...
"John": "John",
"Johann": "John",
...
}
def normalized(name):
return normalized_names.get(name, name)

If you put persons in another dict or a database indexed by the normalised
name

lookup = {
"Kristina": ["Kristina Smith", "Christina Miller"],
...
}

you can find all Kristinas with two look-ups:

lookup[normalized("Kristine")]

Click to expand...

Click to expand...

['Kristina Smith', 'Christina Miller']

PS: A problem with this approach might be that (name in nameset_A) and (name
in nameset_B) implies nameset_A == nameset_B

Tony the Tiger · Jun 14, 2014

I guess that there will be a few hundred such regular expressions
covering most popular names.

Perhaps

http://code.activestate.com/recipes/52213-soundex-algorithm/
https://en.wikipedia.org/wiki/Soundex

will work?

/Grrr

Hooking into Python's memory management	2	May 4, 2011
parsing an Excel formula with the re module	20	Jan 5, 2010
Re for Apache log file format	4	Oct 8, 2013
New implementation of re module	14	Jul 27, 2009
Python's doc problems: sort	11	Apr 29, 2008
Python's Reference And Internal Model Of Computing Languages	7	Feb 2, 2010
On re / regex replacement	3	Aug 28, 2011
Working with named groups in re module	2	Jan 10, 2007

Python's re module and genealogy problem

BrJohan

Robert Kern

Thomas Rachel

Mark H Harris

Michael Torrie

Nick Cash

Simon Ward

Vlastimil Brom

BrJohan

Peter Otten

Tony the Tiger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads