Comparing 2 similar strings?

John Machin · May 19, 2005

Off-topic now, but you've made me curious.

Why is this a bad idea?

How would you handle the case of "barow" and "marow"? (Barrow and
marrow, naturally.) Without the first letter, they sound identical. Why is
throwing that information away a good thing?

Sorry if that was unclear. By "preserving the first letter", I meant
that in "standard" soundex, the first letter is not transformed into a
digit.

Karen -> K650
Kieran -> K650
(R->6, N->5; vowels->0 and then are squeezed out)

Now compare this:
Aaron -> A650
Erin -> E650

Bearing in mind that the usual application of soundex is "all or
nothing", the result is Karen == Kieran, but Aaron !== Erin, which is
at the very least extremely inconsistent.

A better phonetic-key creator would produce the same result for each
of the first pair, and for each of the second pair -- e.g. KARAN and
ARAN respectively.

Also consider Catherine vs Katherine.

Cheers,
John

Patrick TJ McPhee · May 20, 2005

% None of the other approaches make the mistake of preserving the first
% letter -- this alone is almost enough reason for jettisoning soundex.

Metaphone does, at least in the case of Aaron and Erin.

John Machin · May 20, 2005

% None of the other approaches make the mistake of preserving the first
% letter -- this alone is almost enough reason for jettisoning soundex.

Metaphone does, at least in the case of Aaron and Erin.

You're quite correct; moreover NYSIIS does that too.

Dunno what came over me, must be gamma rays or something

Steve Holden · May 20, 2005

Dennis said:
Since "soundex" is initial letter (consonant?) and a code for
the next three syllables (or close to it), really long multi-syllabic
names are effectively truncated...

Howe'er... When Maire Brennan releases an album as "Moya",
following sister's "Enya" (Eithne, as I seem to recall reading)... I'd
not attempt to pronounce most of the names you supply... "Dalziell"
doesn't look Celtic... "Culzean" almost looks Aztec/Mayan... "Ceilidh"
=> kay-lee?

Okay, I think I can manage bain sidhe and uisge (after too much
of the latter, I'll be seeing the former)

Well, as an Englishman who has spent a good deal of time in Scotland I'd
essay the following. If there are any Scots reading they can chastise me
with glee for my presumption.

Dalziell: "Da'y yell"
Drumnadrochit: "Dru'mnadro'ckit"
(but the Scots would insist you use
a gutteral for the "ch", I'm not sure
how to render that phonetically. It's
a bit like the sound made before spitting,
or the "G" in Guido in Dutch

.
Culzean: "Ka La'ne"
Ceilidh: "Ca'yli" (once had a border collie called
Ceilidh).
Concobarh: (is this the same as 'Conchobar'?)
Co'nnahwar

promoting-scottish-english-unity-ly y'rs - steve

Steve Holden · May 20, 2005

Chris said:
Fantastic suggestion. Here's a tiny piece of real-life test data:

compare the surnames "Mousaferiadis" and "McPherson".

Click to expand...

I remember a word processor, MultiMate, which used Soundex to do
matching for its suggestions for spelling correction. One of my
cow-orkers typed the word 'agains' -- it was supposed to be 'against'
but 'again' would also have been a sensible suggestion. MultiMate,
however, suggested neither of those reasonable words, it did suggest
"iguanas" amd "Utahns"...

(I wonder what it does with "Talliafero" and "Tolliver", and with
"Featherstone-Howe" and "Fanshaw"...)

The answer to the OP, which someone just pointed out to me on
comp.programming, is "edit distance" or "Levenshtein Distance"[1]. A
google search on either produces some good descriptions in the first few
links, including http://www.merriampark.com/ld.htm which has not only a
description of the algorithm but also source code in Java, C++ and
Visual Basic (no Awk, thought there are links to pages with
implementations in Perl, Python, Delphi and many more)...

[1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein'
in Schwaebisch (south German) fashion, but apparently the author of the
algorithm was one Vladimir I. Levenshtein and that is how he is credited
on the IEEE site. There are also a number of Google hits under the
'stein' spelling, though...

Chris C

And what's the Levenshtein distance between "levenshtein" and
"levenstein"? Ironic if it were large ...

regards
Steve

Skip Montanaro · May 20, 2005

Steve> (is this the same as 'Conchobar'?)

No, that's a trendy pub in Key West...

<wink>

Skip

Oleg Paraschenko · May 23, 2005

Hello William,

William Park said:
How do you compare 2 strings, and determine how much they are "close" to
each other?
...

If your strings are between 1 and 16 Kb, look at GetReuse SDK:

http://getreuse.com/sdk/

It has Perl and Python bindings. You might be interested in the
latter as you posted to comp.lang.python.

I can't disclosure the algorithm. Here is an excerpt from the
home page: The formula for the calculation of the similarity is
based on the scientific research. Any other "good" method of
calculations should produce results that are equivalent in some
terms to the GetReuse results.

Oleg Paraschenko · May 23, 2005

Hello,

Scott David Daniels said:
Here's a really weird idea: Measure the size difference between the
pair of strings compressed together and compressed separately.

The idea isn't weird. The only problem is that naive approach
failed. compress(a,b) != compress(b,a). Having an assumption
that compressing is a good approximation for the Kolmogorov
complexity, the correct formula is a bit more complicated.

nicolas.lehuen · May 25, 2005

William Park a écrit :

How do you compare 2 strings, and determine how much they are "close" to
each other? Eg.
aqwerty
qwertyb
are similar to each other, except for first/last char. But, how do I
quantify that?

I guess you can say for the above 2 strings that
- at max, 6 chars out of 7 are same sequence --> 85% max

But, for
qawerty
qwerbty
max correlation is
- 3 chars out of 7 are the same sequence --> 42% max

(Crossposted to 3 of my favourite newsgroup.)

Hi,

If you want to use phonetic comparison, here are some algorithms that
are reportedly more efficient than Soundex :

Double-Metaphone
NYSIIS
Phonex

Of course, phonetic algorithms have a lot of disadvantages, the main
one being that they know about one way to pronounce words (usually a
rather rigid, anglo-saxon way) which may not be the right way (hence
the examples given before for Gaellic surnames). But these ones are far
"better" than soundex.

Regards,

Nicolas Lehuen

Roger Binns · May 26, 2005

***Check out difflib, it's in the library.*** Perfect package for what
the OP wants AFAICT.

The method in difflib is okay, but doesn't do that good a job. It
is also relatively slow. My need for this was matching records in
BitPim (eg which entries just read from a cell phone correspond to
which entries just read from Outlook)

Here are some links for the OP as well as ready to run code at the
end.

A paper on various algorithms and their effectiveness on various data
sets.

http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf

They do have a Java library of all the algorithms by the same authors.

http://secondstring.sourceforge.net/

There is a group at the Australian National University with a project named
Febrl - Freely extensible biomedical record linkage. The idea is to be able
to work out if two records from the same or different sources are the same
person. All their code is in Python:

http://datamining.anu.edu.au/software/febrl/febrldoc/

The solution I settled on is using Jaro-Winkler. This is a quote from the
paper above:

A surprisingly good distance metric is a fast heuristic scheme, proposed
by Jaro (Jaro 1995; 1989) and later extended by Winkler (Winkler 1999).
This works almost as well as the Monge-Elkan scheme, but
is an order of magnitude faster

I did find a Python module that implemented this (it is part of the
python-levenshtein package). Unfortunately there were some issues
at the time dealing with some strings that may be unicode and some
not. There was one implementation I saw that took copies of both
strings and was replacing characters with asterisks. My data sets
include meaningful asterisks so that seriously munged things.

Ultimately I wrote my own implementation with the following properties:

- The arguments can be any combination of unicode and ordinary strings
- There are no special characters nor ones used internally as replacements
- A bitset is used for tracking so you use an eigth of the memory
- There is both Python and C code implementations so you don't have
to compile anything but can if you want

The code can be grabbed from:

http://cvs.sf.net/viewcvs.py/bitpim/bitpim/native/strings/

There is a detailed description at the begining of the Python implementation:

http://cvs.sf.net/viewcvs.py/bitpim/bitpim/native/strings/jarowpy.py?view=markup

Roger

Rob W. W. Hooft · May 28, 2005

After reading this thread, I have wrapped up a different approach,
probably not what you were looking for, but it is very good for what I
wanted: comparing a command string typed by a user with all possible
commands a program can accept, to be able to do typo-correction. The
method will return a floating point number between 0.0 (no match) and
len(s)+1.0 (if the strings are exactly equal). You can see the score as
"the number of equal letters".

I put the module at http://starship.python.net/crew/hooft/
It is called comparestrings.py.

It is a comparison algorithm as implemented by Reinhard Schneider and
Chris Sander for comparison of protein sequences, but implemented to
compare two ASCII strings.

The comparison makes use of a similarity matrix for letters: in the
protein case this is normally a chemical functionality similarity, for
our case this is a matrix based on keys next to each other on a US
Qwerty keyboard and on "capital letters are similar to their lowercase
equivalent"

The algorithm does not cut corners: it is sure to find the absolute best
match between the two strings.

No attempt has been made to make this efficient in time or memory. Time
taken and memory used is proportional to the product of the lengths of
the input strings. Use for strings longer than 25 characters is entirely
for your own risk.

Regards,

Rob Hooft

Irmen de Jong · May 28, 2005

Rob said:
After reading this thread, I have wrapped up a different approach,
probably not what you were looking for, but it is very good for what I
wanted: comparing a command string typed by a user with all possible
commands a program can accept, to be able to do typo-correction. The
method will return a floating point number between 0.0 (no match) and
len(s)+1.0 (if the strings are exactly equal). You can see the score as
"the number of equal letters".

[...]
Interesting, but the result is a bit puzzling.
Wouldn't it be easier if it is normalized to a float between 0.0 and 1.0?

Furthermore, I can also produce wrong(?) results:

$ python comparestrings.py
s1: test
s2: x
Score: -0.4

Minus 0.4... Is this a bug?

--Irmen

Rob W. W. Hooft · May 29, 2005

Irmen said:
Wouldn't it be easier if it is normalized to a float between 0.0 and 1.0?

Maybe. You could do that by ignoring "negative" values, and by dividing
by min(len(s1),len(s2))+1. For my application this is irrelevant, I only
need a scale to compare a single word to many different words, and pick
the one that stands out. I can relate the (unnormalized) score to the
number of typos in the word.

Furthermore, I can also produce wrong(?) results:

$ python comparestrings.py
s1: test
s2: x
Score: -0.4

Minus 0.4... Is this a bug?

Not in the procedure, it is a bug in my description. It was late at
night when I wrote it. What you are doing here is aligning the two words
like this:

test
--x-

the t opens a gap (score -0.3), e elongates the gap (-0.2), s and x are
adjacent on the keyboard (score 0.4) and t opens a gap (-0.3). Total
score is -0.4. If you compare "test" with "l" you even get a score of
-0.7. You can make arbitrarily low numbers by comparing long strings of
"a" characters with a single "l".

What this is telling you is that not only are the letters all different
(score 0.0), but the strings are not even equally long (hence <0.0). The
gap-open and gap-elongation penalties can be adjusted in the module. The
-0.3 and -0.2 values were set after some experimentation to make the
relative scores of different matches "feel" natural.

Comparing strings from the back?	91	Sep 4, 2012
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Minimum Total Difficulty	0	Nov 15, 2023
comparing two lists, ndiff performance	3	Jan 30, 2008
Problem with comparing huge amount of strings	10	Oct 6, 2008
API design for Python 2 / 3 compatibility	3	Apr 13, 2013
Interning own classes like strings for speed and size?	11	Dec 27, 2010
How to connect villages with a specific amount of roads?	0	May 29, 2022

Comparing 2 similar strings?

John Machin

Patrick TJ McPhee

John Machin

Steve Holden

Steve Holden

Skip Montanaro

Oleg Paraschenko

Oleg Paraschenko

nicolas.lehuen

Roger Binns

Rob W. W. Hooft

Irmen de Jong

Rob W. W. Hooft

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads