K
kcobra
The company I work is doing a website in German. The website has a
search engine for it's content. My question is how to recognize the
various "ways" a user might represent an ISO-8859-1 character in his
search string.
Various articles suggest normalizing the content via one of the
Unicode Normalization Forms when the content is indexed into the
search engine. Then when the search is ran you normalize the user's
search string via the same algorithm.
I pretty much understand all that but what I don't understand is how
to map some of the user entered representations of "special"
characters to the normalized form. For example, according to our
German product owner if a user wants to search for "Äffin" he might
enter in "Äffin" or he might enter in "Aeffin". The normalized form of
"Aeffin" is still "Aeffin" so how would the search engine ever find a
hit since it indexed the normalized form of "Äffin"?
Thanks for the help.
-Wade
search engine for it's content. My question is how to recognize the
various "ways" a user might represent an ISO-8859-1 character in his
search string.
Various articles suggest normalizing the content via one of the
Unicode Normalization Forms when the content is indexed into the
search engine. Then when the search is ran you normalize the user's
search string via the same algorithm.
I pretty much understand all that but what I don't understand is how
to map some of the user entered representations of "special"
characters to the normalized form. For example, according to our
German product owner if a user wants to search for "Äffin" he might
enter in "Äffin" or he might enter in "Aeffin". The normalized form of
"Aeffin" is still "Aeffin" so how would the search engine ever find a
hit since it indexed the normalized form of "Äffin"?
Thanks for the help.
-Wade