Finding similar strings

D

Davor Cengija

Does anyone know of a (preferably Java) implementation of some algorithm
which compares two strings, or a string and set of strings, and returns some
kind of similarity factor or the most similar string, and everything
regarding to some rules (e.g. PC keyboard, cell phone, graphity palm,
etc...) The idea is to take a string and if marked as incorrect, to offer a
list of possible corrections? Maybe some kind of spelling checker
implementation/algorithm could help as well. Anything available in Java?

I have looked at Soundex and Metaphone but they are for english words and
how they sound, which is not exactly what I'm looking for. I need e.g.

Keyboard input:
Jaca
Should be
Java (c is close to v on an IBM keyboard)

Cellphone input:
Jaua
Should be
Java (u is close to v on a cellphone keyboard)

Thanks
 
M

Michael Borgwardt

Davor said:
I have looked at Soundex and Metaphone but they are for english words and
how they sound, which is not exactly what I'm looking for. I need e.g.

Keyboard input:
Jaca
Should be
Java (c is close to v on an IBM keyboard)

Seems to me like what you want is to calculate the Levenshtein distance. This
website has an explanation and even a Java implementation:
http://www.merriampark.com/ld.htm

Of course it would be very inefficient to run it on your complete dictionary
for every word. Maybe there's a variation somewhere that allows some sort of
lookup in a precompiled search tree.
 
D

Davor Cengija

Michael said:
Seems to me like what you want is to calculate the Levenshtein
distance. This website has an explanation and even a Java
implementation: http://www.merriampark.com/ld.htm

Will check, thanks!
(Checking....)

Yes, it might help. I could ignore the keyboard similarity for now.
Of course it would be very inefficient to run it on your complete
dictionary for every word. Maybe there's a variation somewhere that
allows some sort of lookup in a precompiled search tree.

Of course. The idea is to help people with fingers like sis-kebabs .-)) to
enter some predefined keywords more efficiently.

Thank you,
Davor
 
M

Michael Borgwardt

Davor said:
Will check, thanks!
(Checking....)

Yes, it might help. I could ignore the keyboard similarity for now.

I think it could be integrated by giving each difference a "weight"
that depends on your "keyboard similarity".
Of course. The idea is to help people with fingers like sis-kebabs .-)) to
enter some predefined keywords more efficiently.

If the number of keywords is small (i.e. up to a few hundred) and you want to
check only entries of one or a few words, then forget my comment about
efficiency, it won't be a problem.
 
T

Thomas Schodt

Michael said:
I think it could be integrated by giving each difference a "weight"
that depends on your "keyboard similarity".

Looking only at standard ASCII letters, there are some variations to
consider; QWERTY, AZERTY, QWERTZ, not to mention Dvorak...
 
R

Roedy Green

Looking only at standard ASCII letters, there are some variations to
consider; QWERTY, AZERTY, QWERTZ, not to mention Dvorak...

and for many more, see http://mindprod.com/jgloss/dsk.html

You can tell which sort of keyboard layout someone uses by his typos.
For example you will see DSK typists often confusing it/in/is
that/than which a spell checker cannot filter.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top