Luc The Perverse wrote:
....
For instance if I want to search for rückenwind, it will take me at least
several additional seconds to experiment with the ALT keys to find the right
character combination for ü. (Although learning the character codes is
easier than attempting to learn alternate German spellings.) It doesn't
mean I don't know that there is an accent there, it is just easier to type
ruckenwind.
And if you cringe at that, you might faint to see the approximate
transliterations that I have used to name my Russian MP3's!
You decided to go for the "accent-removing" way though there
are many different ways to achieve the functionality you're after.
To me this looks very similar to a spelling algorithm (even though
you're using several languages at once).
For every search word entered you could:
-check exact matches
-check "one typo" matches (eg "Pixes" instead of "Pixies")
-check "two typos" matches (eg "Spellshaker" instead of "Spellchecker")
-check all "sound-alike" words (using eg Soundex or Double-Metaphone)
-rank all the propositions found using these various techniques by
their "closeness" to the search term entered (using Levenhstein's
edit distance, implemented using a DP algorithm).
This should allow to enter really badly written band names/song
names and still find back the song(s) you're after.
For example, my home-made spellchecker's first proposition for
"paulitiquale" ("political" spelled in a strange french-phonetical
way) is, surely enough, "political" (note that Un*x' aspell/ispell
works the same way).
Now of course this is lots of work for a simple functionality, but
you could still use something similar but much much simpler
(take could be implemented very easily):
- for every correct name you have in your database, you create
a hashmap with all vowels removed
Pixies -> Pxs
Pink Floyd -> Pnk Fld
rückenwind -> rckwnd
etc.
Then, when someone enters a search term, *if you don't find
an exact match*, you try to remove every vowels
from the entered search terms and sees if this corresponds
to something in your hashmap.
You can get a little more fancy by checking if the entry (or
entries) in the hashmap really correspond to something close
to the entered search term by calculating the "edit distance"
between the two strings.
It really looks like what a spellchecker would do: the whole
point is not using a single technique, but a variety of
techniques. Each one augmenting the probability that
the search gives back a meaningfull result.
FWIW,
Alex
P.S: this post directly edited in groups.google.com, without
bothering to copy&paste in a spellchecker