Python's re module and genealogy problem

Discussion in 'Python' started by BrJohan, Jun 11, 2014.

  1. BrJohan

    BrJohan Guest

    For some genealogical purposes I consider using Python's re module.

    Rather many names can be spelled in a number of similar ways, and in
    order to match names even if they are spelled differently, I will build
    regular expressions, each of which is supposed to match a number of
    similar names.

    I guess that there will be a few hundred such regular expressions
    covering most popular names.

    Now, my problem: Is there a way to decide whether any two - or more - of
    those regular expressions will match the same string?

    Or, stated a little differently:

    Can it, for a pair of regular expressions be decided whether at least
    one string matching both of those regular expressions, can be constructed?

    If it is possible to make such a decision, then how? Anyone aware of an
    algorithm for this?
     
    BrJohan, Jun 11, 2014
    #1
    1. Advertisements

  2. BrJohan

    Robert Kern Guest

    And if that isn't the best straight line for the old saying, I don't know what is.

    http://en.wikiquote.org/wiki/Jamie_Zawinski

    Anyways, to your new problem, yes it's possible. Search for "regular expression
    intersection" for possible approaches. You will probably have to translate the
    regular expression to a different formalism or at least a different library to
    implement this.

    Consider just listing out the different possibilities. All of your regexes
    should be "well-behaved" given the constraints of the domain (tightly bounded,
    at least). There are tools that help generate matching strings from a Python
    regex. This will help you QA your regexes, too, to be sure that they match what
    you expect them to and not match non-names.

    https://github.com/asciimoo/exrex

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Jun 11, 2014
    #2
    1. Advertisements

  3. Am 11.06.2014 14:23 schrieb BrJohan:
    Just a feeling-based statement: I don't think that is easily possible.

    Every RE can potentially match an infinite number of statements.

    Just have a look at

    re1 = re.compile('A43.*')
    re2 = re.compile('.*[0-9][A-F]')

    It can easily seen that the area these REs work on is different; they
    are orthogonal.

    So there is an infinite number of strings these REs match.


    Thomas
     
    Thomas Rachel, Jun 11, 2014
    #3
  4. I agree, I would not use a decision (decision tree) but would consider
    a set of filters from most specific to least specific.

    marcus
     
    Mark H Harris, Jun 11, 2014
    #4
  5. You might want to search for fuzzy matching algorithms. Years ago, there
    was an algorithm called soundex that would generate fuzzy fingerprints
    for words that would hide differences in spelling, etc. Unfortunately
    such an algorithm would be language dependent. The problem you are
    trying to solve is one of those very hard problems in computers and math.
     
    Michael Torrie, Jun 11, 2014
    #5
  6. BrJohan

    Nick Cash Guest

    Soundex is actually not horrible, but it is definitely only for English
    names. Newer variants of Metaphone
    (http://en.wikipedia.org/wiki/Metaphone) are significantly better, and
    support quite a few other languages. Either one would most likely be
    better than the regex approach.

    Side note: if your data happens to be in MySQL then it has a builtin
    "sounds_like()" function that compares strings using soundex.
     
    Nick Cash, Jun 11, 2014
    #6
  7. BrJohan

    Simon Ward Guest

    As has been mentioned, you probably want to look at fuzzy matching algorithms rather than aiming at regular expressions, although a quick search suggests there has been some work on fuzzy matching with regular expressions[1].
    If your regexes are truly regular expressions (see [2]*) then they represent regular languages[3], which are really sets. The intersection of these, is another regular language. If you test the string against this it will also match both original languages.

    (*this only mentions back references, but I think the look-ahead/behind assertions are also non-regular)

    [1]: http://laurikari.net/tre/about/
    [2]: https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
    [3]: https://en.wikipedia.org/wiki/Regular_language

    Simon
     
    Simon Ward, Jun 11, 2014
    #7
  8. Hi,
    i guess, you could reuse some available generators for strings
    matching a given regular expression, see e.g.:
    http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/
    for example a pyparsing recipe:
    http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/5006339#5006339

    which might be general enough for your needs - of course, you cannot
    use unbound quantifiers, backreferences, etc.

    Then you can test for identical strings in the generated outputs -
    e.g. using the set(...) and its intersection method.

    You might also check a much more powerful regex library
    https://pypi.python.org/pypi/regex

    which, beyond other features, also supports the mentioned fuzzy matches, cf.
    (but, of course, you will have to be careful with this feature in
    order to reduce false positives)

    hth,
    vbr
     
    Vlastimil Brom, Jun 11, 2014
    #8
  9. BrJohan

    BrJohan Guest

    Thank you all for valuable input and interesting thoughts.

    After having reconsidered my problem, it might be better to approach it
    a little differently.

    Either to state the regexps simply like:
    "(Kristina)|(Christina)|(Cristine)|(Kristine)"
    instead of "((K|(Ch))ristina)|([CK]ristine)"

    Or to put the namevariants in some sequence of sets having elements like:
    ("Kristina", "Christina", "Cristine", "Kristine")
    Matching is then just applying the 'in' operator.

    I see two distinct advantages.
    1. Readability and maintainability
    2. Any namevariant occurring in just one regexp or set means no risk of
    erroneous matching.

    Comments?
     
    BrJohan, Jun 13, 2014
    #9
  10. BrJohan

    Peter Otten Guest

    I like the simple variant

    kristinas = ("Kristina", "Christina", "Cristine", "Kristine")

    But instead of matching with "in" you could build a dict that maps the name
    variants to a normalised name

    normalized_names = {
    "Kristina": "Kristina",
    "Christina": "Kristina",
    ...
    "John": "John",
    "Johann": "John",
    ...
    }
    def normalized(name):
    return normalized_names.get(name, name)

    If you put persons in another dict or a database indexed by the normalised
    name

    lookup = {
    "Kristina": ["Kristina Smith", "Christina Miller"],
    ...
    }

    you can find all Kristinas with two look-ups:
    ['Kristina Smith', 'Christina Miller']

    PS: A problem with this approach might be that (name in nameset_A) and (name
    in nameset_B) implies nameset_A == nameset_B
     
    Peter Otten, Jun 13, 2014
    #10
  11. Tony the Tiger, Jun 14, 2014
    #11
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.