Counting elements in a list wildcard

hawkesed · Apr 25, 2006

If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

Ben Finney · Apr 25, 2006

Ryan Ginstrom said:
If there are specific spellings you want to allow, you could just
create a list of them and see if your Suzy is in there:

possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Jane' ]
for line in my_strings:

Click to expand...

Click to expand...

... if line in possible_suzys: print line
...
Susi

If you wanted to do something later, rather than only during the scan
over the list, getting a list of suzies would probaby be more useful:

>>> possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
>>> my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ]
>>> found_suzys = [s for s in my_strings if s in possible_suzys]
>>> found_suzys

Click to expand...

Click to expand...

['Susi', 'Susy']

Dave Hughes · Apr 25, 2006

hawkesed said:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

You might want to check out the SoundEx and MetaPhone algorithms which
provide approximations of the "sound" of a word based on spelling
(assuming English pronunciations).

Apparently a soundex module used to be built into Python but was
removed in 2.0. You can find several implementations on the 'net, for
example:

http://orca.mojam.com/~skip/python/soundex.py
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

MetaPhone is generally considered better than SoundEx for "sounds-like"
matching, although it's considerably more complex (IIRC, although it's
been a long time since I wrote an implementation of either in any
language). A Python MetaPhone implementations (there must be more than
this one?):

http://joelspeters.com/awesomecode/

Another algorithm that might interest isn't based on "sounds-like" but
instead computes the number of transforms necessary to get from one
word to another: the Levenshtein distance. A C based implementation
(with Python interface) is available:

http://trific.ath.cx/resources/python/levenshtein/

Whichever algorithm you go with, you'll wind up with some sort of
"similar" function which could be applied in a similar manner to Ben's
example (I've just mocked up the following -- it's not an actual
session):

>>> import soundex
>>> import metaphone
>>> import levenshtein
>>> my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ]
>>> found_suzys = [s for s in my_strings if soundsex.sounds_similar(s, 'Susy')]
>>> found_suzys = [s for s in my_strings if metaphone.sounds_similar(s, 'Susy')]
>>> found_suzys = [s for s in my_strings if levenshtein.distance(s, 'Susy') < 4]
>>> found_suzys

Click to expand...

Click to expand...

['Susi', 'Susy'] (one hopes anyway!)

HTH,

Dave.
--

Edward Elliott · Apr 25, 2006

Dave said:
Another algorithm that might interest isn't based on "sounds-like" but
instead computes the number of transforms necessary to get from one
word to another: the Levenshtein distance. A C based implementation
(with Python interface) is available:

I don't know what algorithm it uses, but the difflib module looks similar.
I've had good results using the get_close_matches function to locate
similarly-named mp3 files.

However I don't think "close enough" is well suited for this application.
The sequences are short and non-distinct. Difference matching needs longer
sequences to be effective. Phoneme matching seems overly complex and might
grab things like Tsu-zi. I'd just use a list of alternate spellings like
Ben suggested.

Dennis Lee Bieber · Apr 25, 2006

If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:

If this were a genealogy exercise, the suggestion would be to
convert all the names to Soundex, then look for Soundex matches.

And guess what -- it (the Soundex part) has already been written <G>

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213
--

Iain King · Apr 25, 2006

hawkesed said:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

Dare I suggest using REs? This looks like something they'de be good
for:

import re

def countMatches(names, namePattern):
count = 0
for name in names:
if namePattern.match(name):
count += 1
return count

susie = re.compile("Su(s|z)(i|ie|y)")

print countMatches(["John", "Suzy", "Peter", "Steven", "Susie",
"Susi"], susie)

some other patters:

iain = re.compile("(Ia(i)?n|Eoin)")
steven = re.compile("Ste(v|ph|f)(e|a)n")
john = re.compile("Jo(h)?n")

Iain

John Machin · Apr 25, 2006

Phoneme matching seems overly complex and might
grab things like Tsu-zi.

It might *only* if somebody had a rush of blood to the head and devised
yet another phonetic key "algorithm". Tsuzi does *not* give the same
result as any of Suzi, Suzie, Susi, and Susie when pushed through any of
the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None
of them throw away the 'T' sound.

John Machin · Apr 25, 2006

hawkesed said:
hawkesed said:

If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

Click to expand...

Dare I suggest using REs? This looks like something they'de be good
for:

import re

def countMatches(names, namePattern):
count = 0
for name in names:
if namePattern.match(name):
count += 1
return count

susie = re.compile("Su(s|z)(i|ie|y)")

print countMatches(["John", "Suzy", "Peter", "Steven", "Susie",
"Susi"], susie)

some other patters:

iain = re.compile("(Ia(i)?n|Eoin)")
steven = re.compile("Ste(v|ph|f)(e|a)n")

What about Steffan, Etienne, Esteban, István, ... ?

john = re.compile("Jo(h)?n")

IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.

Iain King · Apr 25, 2006

John said:
snip

What about Steffan, Etienne, Esteban, István, ... ?

well, obviously these could be included:
"(Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban)", but the OP never said he
wanted to translate anything into another language. He just wanted to
catch variable spellings.

IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.

Iain

John Machin · Apr 25, 2006

well, obviously these could be included:
"(Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban)", but the OP never said he
wanted to translate anything into another language.

Neither did I. But if you have to cope with a practical situation like
where the birth certificate says István and the job application says
Steven and the foreman calls him Steve, you won't be stuffing about with
hand-crafted REs, one per popular given name. Could be worse: the punter
could have looked up a dictionary and changed his surname from Kovács to
Smith; believe me -- it happens.

Oh and if you cast your net as wide as the Pacific islands, chuck in
Sitiveni. That's enough examples. We won't go near Benjamin

Edward Elliott · Apr 25, 2006

John said:
IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.

Also compare the readability of regular expressions in this case to a simple
list:
["Steven", "Stephen", "Stefan", "Stephan", ...]

Edward Elliott · Apr 25, 2006

John said:
It might *only* if somebody had a rush of blood to the head and devised
yet another phonetic key "algorithm". Tsuzi does *not* give the same
result as any of Suzi, Suzie, Susi, and Susie when pushed through any of
the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None
of them throw away the 'T' sound.

Spelling isn't phonetic. The 't' character doesn't necessarily affect
pronounciation. Or it may affect pronounciation in a way the soundex
doesn't understand (think tonal languages). Latinizing foreign languages
raises all sorts of problems.

A soundex is only as good as its pronounciation database. It may work well
in many situations, but it isn't fool-proof.

Iain King · Apr 26, 2006

Edward said:
John said:

IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.

Click to expand...

Also compare the readability of regular expressions in this case to a simple
list:
["Steven", "Stephen", "Stefan", "Stephan", ...]

Somehow I'm the advocate for REs here, which: erg. But you have some
mighty convenient elipses there...
compare:

steven = re.compile("Ste(v|ph|f|ff)(e|a)n")
steven = ["Steven", "Stephen", "Stefen", "Steffen", "Stevan",
"Stephan", "Stefan", "Steffan"]

I know which I'd rather type. 'Course, if you can use a ready-built
list of names...

Iain

Edward Elliott · Apr 26, 2006

Iain said:
steven = re.compile("Ste(v|ph|f|ff)(e|a)n")
steven = ["Steven", "Stephen", "Stefen", "Steffen", "Stevan",
"Stephan", "Stefan", "Steffan"]

I know which I'd rather type. 'Course, if you can use a ready-built
list of names...

Oh I agree, I'd rather *type* the former, but I'd rather *read* the
latter.

Edward Elliott · Apr 26, 2006

Iain said:
steven = re.compile("Ste(v|ph|f|ff)(e|a)n")

Also you can expand the RE a bit to improve readability:

re.compile("Stev|Steph|Stef|Steff)(en|an)")

Wildcard String Comparisons: Set Pattern to a Wildcard Source	7	Oct 5, 2010
counting '.' in a domain name	7	Feb 25, 2013
String and list error while running a Markov Chain	1	Aug 26, 2020
remove elements incrementally from a list	4	May 19, 2010
Lowest Value in List	5	Oct 2, 2013
using "*" to make a list of lists with repeated (and independent) elements	18	Sep 26, 2012
Slicing iterables in sub-generators without loosing elements	19	Sep 29, 2012
referencing a subhash for generalized ngram counting	3	Nov 13, 2007

Counting elements in a list wildcard

hawkesed

Ben Finney

Dave Hughes

Edward Elliott

Dennis Lee Bieber

Iain King

John Machin

John Machin

Iain King

John Machin

Edward Elliott

Edward Elliott

Iain King

Edward Elliott

Edward Elliott

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads