Hi Hendrik!
Hendrik Maryns wrote:
....
Nce idea, but in the end you end up removing all letters: how about ñ
(Spanish), ý, ð, þ (Icelandic), ç (French), ß (German), s (Czech?), and
those with a v on top of them, I can't find them either...
I know... Which is why I said that the idea of a spellchecker is not
to use a single technique but a variety of techniques.
You've got a "result List<SortableString>", each technique adds
zero or more "propositions" to this List.
- You first try all correct matches
- You then try all "one letter typos" (one wrong letter, one letter
not entered, one letter typed too much)
- You then try all "two letters typos" (still easily doable)
- You then try all close "no vowels" words
- etc.
Then you rank all those propositions according to their "edit
distance" to correct band names/song names.
It's not computationally intensive: at billions operations/second
you can get "creative". (it can be memory consuming though if
you've got hundreds of thousands of songs, then you need to
bypass Java's object [HashMap and String just won't cut it] and
use your own data structure. I know, for I got to 22 bits per
word on a spellchecker able to deal with dictionaries handling
hundreds of thousands of words -- try to do that with Jazzy
And there's no "bad technique" either. Each trick simply can
lead to propositions.
Imagine a really bad technique adds "gigolo" to the list of
proposition when the user entered "gogle" (looking for "google"),
well... The "one letter typo" check will have found "google".
So propositions will contains ?????, ?????, gigolo, ?????, google, ????
Then they get sorted by edit distance to the user's entered search
string: google is close to gogle (edit distance of 1) while gigolo is
not that close to gogle. It doesn't mean that "gigolo" isn't proposed
but simply that "google" is proposed first.
I'll re-write what I wrote in the previous post: the whole point is
not using a single technique, but a variety of techniques. Each one
augmenting the probability that the search gives back a meaningfull
result.
And what to do if you want to look for this song: second on
http://en.wikipedia.org/wiki/Windowlicker#Track_listing ??
Ah, Aphex Twin
Well, for this one I admit I'm not so sure... I'd add a "special case"
and I would have the only spellchecker/search engine in the world
able to deal with that case!