Counting elements in a list wildcard

H

hawkesed

If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed
 
B

Ben Finney

Ryan Ginstrom said:
If there are specific spellings you want to allow, you could just
create a list of them and see if your Suzy is in there:
possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Jane' ]
for line in my_strings:
... if line in possible_suzys: print line
...
Susi

If you wanted to do something later, rather than only during the scan
over the list, getting a list of suzies would probaby be more useful:
>>> possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
>>> my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ]
>>> found_suzys = [s for s in my_strings if s in possible_suzys]
>>> found_suzys
['Susi', 'Susy']
 
D

Dave Hughes

hawkesed said:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

You might want to check out the SoundEx and MetaPhone algorithms which
provide approximations of the "sound" of a word based on spelling
(assuming English pronunciations).

Apparently a soundex module used to be built into Python but was
removed in 2.0. You can find several implementations on the 'net, for
example:

http://orca.mojam.com/~skip/python/soundex.py
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

MetaPhone is generally considered better than SoundEx for "sounds-like"
matching, although it's considerably more complex (IIRC, although it's
been a long time since I wrote an implementation of either in any
language). A Python MetaPhone implementations (there must be more than
this one?):

http://joelspeters.com/awesomecode/

Another algorithm that might interest isn't based on "sounds-like" but
instead computes the number of transforms necessary to get from one
word to another: the Levenshtein distance. A C based implementation
(with Python interface) is available:

http://trific.ath.cx/resources/python/levenshtein/

Whichever algorithm you go with, you'll wind up with some sort of
"similar" function which could be applied in a similar manner to Ben's
example (I've just mocked up the following -- it's not an actual
session):
>>> import soundex
>>> import metaphone
>>> import levenshtein
>>> my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ]
>>> found_suzys = [s for s in my_strings if soundsex.sounds_similar(s, 'Susy')]
>>> found_suzys = [s for s in my_strings if metaphone.sounds_similar(s, 'Susy')]
>>> found_suzys = [s for s in my_strings if levenshtein.distance(s, 'Susy') < 4]
>>> found_suzys
['Susi', 'Susy'] (one hopes anyway!)


HTH,

Dave.
--
 
E

Edward Elliott

Dave said:
Another algorithm that might interest isn't based on "sounds-like" but
instead computes the number of transforms necessary to get from one
word to another: the Levenshtein distance. A C based implementation
(with Python interface) is available:

I don't know what algorithm it uses, but the difflib module looks similar.
I've had good results using the get_close_matches function to locate
similarly-named mp3 files.

However I don't think "close enough" is well suited for this application.
The sequences are short and non-distinct. Difference matching needs longer
sequences to be effective. Phoneme matching seems overly complex and might
grab things like Tsu-zi. I'd just use a list of alternate spellings like
Ben suggested.
 
D

Dennis Lee Bieber

If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:

If this were a genealogy exercise, the suggestion would be to
convert all the names to Soundex, then look for Soundex matches.

And guess what -- it (the Soundex part) has already been written <G>

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213
--
 
I

Iain King

hawkesed said:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

Dare I suggest using REs? This looks like something they'de be good
for:

import re

def countMatches(names, namePattern):
count = 0
for name in names:
if namePattern.match(name):
count += 1
return count

susie = re.compile("Su(s|z)(i|ie|y)")

print countMatches(["John", "Suzy", "Peter", "Steven", "Susie",
"Susi"], susie)


some other patters:

iain = re.compile("(Ia(i)?n|Eoin)")
steven = re.compile("Ste(v|ph|f)(e|a)n")
john = re.compile("Jo(h)?n")


Iain
 
J

John Machin

Phoneme matching seems overly complex and might
grab things like Tsu-zi.

It might *only* if somebody had a rush of blood to the head and devised
yet another phonetic key "algorithm". Tsuzi does *not* give the same
result as any of Suzi, Suzie, Susi, and Susie when pushed through any of
the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None
of them throw away the 'T' sound.
 
J

John Machin

hawkesed said:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

Dare I suggest using REs? This looks like something they'de be good
for:

import re

def countMatches(names, namePattern):
count = 0
for name in names:
if namePattern.match(name):
count += 1
return count

susie = re.compile("Su(s|z)(i|ie|y)")

print countMatches(["John", "Suzy", "Peter", "Steven", "Susie",
"Susi"], susie)


some other patters:

iain = re.compile("(Ia(i)?n|Eoin)")
steven = re.compile("Ste(v|ph|f)(e|a)n")

What about Steffan, Etienne, Esteban, István, ... ?
john = re.compile("Jo(h)?n")

IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.
 
I

Iain King

John said:
snip


What about Steffan, Etienne, Esteban, István, ... ?

well, obviously these could be included:
"(Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban)", but the OP never said he
wanted to translate anything into another language. He just wanted to
catch variable spellings.
IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.

Iain
 
J

John Machin

well, obviously these could be included:
"(Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban)", but the OP never said he
wanted to translate anything into another language.

Neither did I. But if you have to cope with a practical situation like
where the birth certificate says István and the job application says
Steven and the foreman calls him Steve, you won't be stuffing about with
hand-crafted REs, one per popular given name. Could be worse: the punter
could have looked up a dictionary and changed his surname from Kovács to
Smith; believe me -- it happens.

Oh and if you cast your net as wide as the Pacific islands, chuck in
Sitiveni. That's enough examples. We won't go near Benjamin :)
 
E

Edward Elliott

John said:
IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.

Also compare the readability of regular expressions in this case to a simple
list:
["Steven", "Stephen", "Stefan", "Stephan", ...]
 
E

Edward Elliott

John said:
It might *only* if somebody had a rush of blood to the head and devised
yet another phonetic key "algorithm". Tsuzi does *not* give the same
result as any of Suzi, Suzie, Susi, and Susie when pushed through any of
the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None
of them throw away the 'T' sound.

Spelling isn't phonetic. The 't' character doesn't necessarily affect
pronounciation. Or it may affect pronounciation in a way the soundex
doesn't understand (think tonal languages). Latinizing foreign languages
raises all sorts of problems.

A soundex is only as good as its pronounciation database. It may work well
in many situations, but it isn't fool-proof.
 
I

Iain King

Edward said:
John said:
IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.

Also compare the readability of regular expressions in this case to a simple
list:
["Steven", "Stephen", "Stefan", "Stephan", ...]

Somehow I'm the advocate for REs here, which: erg. But you have some
mighty convenient elipses there...
compare:

steven = re.compile("Ste(v|ph|f|ff)(e|a)n")
steven = ["Steven", "Stephen", "Stefen", "Steffen", "Stevan",
"Stephan", "Stefan", "Steffan"]

I know which I'd rather type. 'Course, if you can use a ready-built
list of names...

Iain
 
E

Edward Elliott

Iain said:
steven = re.compile("Ste(v|ph|f|ff)(e|a)n")
steven = ["Steven", "Stephen", "Stefen", "Steffen", "Stevan",
"Stephan", "Stefan", "Steffan"]

I know which I'd rather type. 'Course, if you can use a ready-built
list of names...

Oh I agree, I'd rather *type* the former, but I'd rather *read* the
latter. :)
 
E

Edward Elliott

Iain said:
steven = re.compile("Ste(v|ph|f|ff)(e|a)n")

Also you can expand the RE a bit to improve readability:

re.compile("Stev|Steph|Stef|Steff)(en|an)")
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top