Understanding Unicode Normalization Forms?

K

kcobra

The company I work is doing a website in German. The website has a
search engine for it's content. My question is how to recognize the
various "ways" a user might represent an ISO-8859-1 character in his
search string.

Various articles suggest normalizing the content via one of the
Unicode Normalization Forms when the content is indexed into the
search engine. Then when the search is ran you normalize the user's
search string via the same algorithm.

I pretty much understand all that but what I don't understand is how
to map some of the user entered representations of "special"
characters to the normalized form. For example, according to our
German product owner if a user wants to search for "Äffin" he might
enter in "Äffin" or he might enter in "Aeffin". The normalized form of
"Aeffin" is still "Aeffin" so how would the search engine ever find a
hit since it indexed the normalized form of "Äffin"?

Thanks for the help.
-Wade
 
K

kcobra

The company I work is doing a website in German. The website has a
search engine for it's content. My question is how to recognize the
various "ways" a user might represent an ISO-8859-1 character in his
search string.

Various articles suggest normalizing the content via one of the
Unicode Normalization Forms when the content is indexed into the
search engine. Then when the search is ran you normalize the user's
search string via the same algorithm.

I pretty much understand all that but what I don't understand is how
to map some of the user entered representations of "special"
characters to the normalized form. For example, according to our
German product owner if a user wants to search for "Äffin" he might
enter in "Äffin" or he might enter in "Aeffin". The normalized form of
"Aeffin" is still "Aeffin" so how would the search engine ever find a
hit since it indexed the normalized form of "Äffin"?

Thanks for the help.
-Wade

Thinking about this some more, I could certainly do a "one-off" where
I run some logic up front to convert 'Ä' to "ae" before I run the
content through the Unicode Normalization libraries. This seems a bit
hackish though as I would need to know the alternate version of every
high bit character in ISO-8859-1.

Also, I'm wondering if different languages that share the same high-
bit characters have different alternate versions. If so this would
make the problem much harder. As you can probably tell I am know
character encoding expert or a multi-lingual speaker.

I know these questions are not directly related to Java but I was
hoping some of you had experience with these issues.

Thanks,
-Wade
 
R

Roedy Green

Thinking about this some more, I could certainly do a "one-off" where
I run some logic up front to convert 'Ä' to "ae" before I run the
content through the Unicode Normalization libraries

That would be safer than going the other way. Otherwise you might
convert Israel to Isräl
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top