Python's re module and genealogy problem

B

BrJohan

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build
regular expressions, each of which is supposed to match a number of
similar names.

I guess that there will be a few hundred such regular expressions
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?
 
R

Robert Kern

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in order to
match names even if they are spelled differently, I will build regular
expressions, each of which is supposed to match a number of similar names.

I guess that there will be a few hundred such regular expressions covering most
popular names.

Now, my problem: Is there a way to decide whether any two - or more - of those
regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least one string
matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

And if that isn't the best straight line for the old saying, I don't know what is.

http://en.wikiquote.org/wiki/Jamie_Zawinski

Anyways, to your new problem, yes it's possible. Search for "regular expression
intersection" for possible approaches. You will probably have to translate the
regular expression to a different formalism or at least a different library to
implement this.

Consider just listing out the different possibilities. All of your regexes
should be "well-behaved" given the constraints of the domain (tightly bounded,
at least). There are tools that help generate matching strings from a Python
regex. This will help you QA your regexes, too, to be sure that they match what
you expect them to and not match non-names.

https://github.com/asciimoo/exrex

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
T

Thomas Rachel

Am 11.06.2014 14:23 schrieb BrJohan:
Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

Just a feeling-based statement: I don't think that is easily possible.

Every RE can potentially match an infinite number of statements.

Just have a look at

re1 = re.compile('A43.*')
re2 = re.compile('.*[0-9][A-F]')

It can easily seen that the area these REs work on is different; they
are orthogonal.

So there is an infinite number of strings these REs match.


Thomas
 
M

Mark H Harris

Anyways, to your new problem, yes it's possible. Search for "regular
expression intersection" for possible approaches.

I agree, I would not use a decision (decision tree) but would consider
a set of filters from most specific to least specific.

marcus
 
M

Michael Torrie

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build
regular expressions, each of which is supposed to match a number of
similar names.

I guess that there will be a few hundred such regular expressions
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

You might want to search for fuzzy matching algorithms. Years ago, there
was an algorithm called soundex that would generate fuzzy fingerprints
for words that would hide differences in spelling, etc. Unfortunately
such an algorithm would be language dependent. The problem you are
trying to solve is one of those very hard problems in computers and math.
 
N

Nick Cash

You might want to search for fuzzy matching algorithms. Years ago, there
was an algorithm called soundex that would generate fuzzy fingerprints
for words that would hide differences in spelling, etc. Unfortunately
such an algorithm would be language dependent. The problem you are
trying to solve is one of those very hard problems in computers and math.

Soundex is actually not horrible, but it is definitely only for English
names. Newer variants of Metaphone
(http://en.wikipedia.org/wiki/Metaphone) are significantly better, and
support quite a few other languages. Either one would most likely be
better than the regex approach.

Side note: if your data happens to be in MySQL then it has a builtin
"sounds_like()" function that compares strings using soundex.
 
S

Simon Ward

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build

regular expressions, each of which is supposed to match a number of
similar names.

As has been mentioned, you probably want to look at fuzzy matching algorithms rather than aiming at regular expressions, although a quick search suggests there has been some work on fuzzy matching with regular expressions[1].
Now, my problem: Is there a way to decide whether any two - or more -
of
those regular expressions will match the same string?

If your regexes are truly regular expressions (see [2]*) then they represent regular languages[3], which are really sets. The intersection of these, is another regular language. If you test the string against this it will also match both original languages.

(*this only mentions back references, but I think the look-ahead/behind assertions are also non-regular)

[1]: http://laurikari.net/tre/about/
[2]: https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
[3]: https://en.wikipedia.org/wiki/Regular_language

Simon
 
V

Vlastimil Brom

2014-06-11 14:23 GMT+02:00 BrJohan said:
For some genealogical purposes I consider using Python's re module.
...

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least one
string matching both of those regular expressions, can be constructed?

Hi,
i guess, you could reuse some available generators for strings
matching a given regular expression, see e.g.:
http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/
for example a pyparsing recipe:
http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/5006339#5006339

which might be general enough for your needs - of course, you cannot
use unbound quantifiers, backreferences, etc.

Then you can test for identical strings in the generated outputs -
e.g. using the set(...) and its intersection method.

You might also check a much more powerful regex library
https://pypi.python.org/pypi/regex

which, beyond other features, also supports the mentioned fuzzy matches, cf.
regex.findall(r"\bSm(?:ith){e<3}\b", "Smith Smithe Smyth Smythe Smijth") ['Smith', 'Smithe', 'Smyth', 'Smythe', 'Smijth']
(but, of course, you will have to be careful with this feature in
order to reduce false positives)

hth,
vbr
 
B

BrJohan

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build
regular expressions, each of which is supposed to match a number of
similar names.

I guess that there will be a few hundred such regular expressions
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

Thank you all for valuable input and interesting thoughts.

After having reconsidered my problem, it might be better to approach it
a little differently.

Either to state the regexps simply like:
"(Kristina)|(Christina)|(Cristine)|(Kristine)"
instead of "((K|(Ch))ristina)|([CK]ristine)"

Or to put the namevariants in some sequence of sets having elements like:
("Kristina", "Christina", "Cristine", "Kristine")
Matching is then just applying the 'in' operator.

I see two distinct advantages.
1. Readability and maintainability
2. Any namevariant occurring in just one regexp or set means no risk of
erroneous matching.

Comments?
 
P

Peter Otten

BrJohan said:
For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in
order to match names even if they are spelled differently, I will build
regular expressions, each of which is supposed to match a number of
similar names.

I guess that there will be a few hundred such regular expressions
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least
one string matching both of those regular expressions, can be
constructed?

If it is possible to make such a decision, then how? Anyone aware of an
algorithm for this?

Thank you all for valuable input and interesting thoughts.

After having reconsidered my problem, it might be better to approach it
a little differently.

Either to state the regexps simply like:
"(Kristina)|(Christina)|(Cristine)|(Kristine)"
instead of "((K|(Ch))ristina)|([CK]ristine)"

Or to put the namevariants in some sequence of sets having elements like:
("Kristina", "Christina", "Cristine", "Kristine")
Matching is then just applying the 'in' operator.

I see two distinct advantages.
1. Readability and maintainability
2. Any namevariant occurring in just one regexp or set means no risk of
erroneous matching.

Comments?

I like the simple variant

kristinas = ("Kristina", "Christina", "Cristine", "Kristine")

But instead of matching with "in" you could build a dict that maps the name
variants to a normalised name

normalized_names = {
"Kristina": "Kristina",
"Christina": "Kristina",
...
"John": "John",
"Johann": "John",
...
}
def normalized(name):
return normalized_names.get(name, name)

If you put persons in another dict or a database indexed by the normalised
name

lookup = {
"Kristina": ["Kristina Smith", "Christina Miller"],
...
}

you can find all Kristinas with two look-ups:
['Kristina Smith', 'Christina Miller']

PS: A problem with this approach might be that (name in nameset_A) and (name
in nameset_B) implies nameset_A == nameset_B
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,056
Messages
2,570,446
Members
47,101
Latest member
vaibhav87

Latest Threads

Top