Understanding Unicode Normalization Forms?

kcobra · Jun 2, 2008

The company I work is doing a website in German. The website has a
search engine for it's content. My question is how to recognize the
various "ways" a user might represent an ISO-8859-1 character in his
search string.

Various articles suggest normalizing the content via one of the
Unicode Normalization Forms when the content is indexed into the
search engine. Then when the search is ran you normalize the user's
search string via the same algorithm.

I pretty much understand all that but what I don't understand is how
to map some of the user entered representations of "special"
characters to the normalized form. For example, according to our
German product owner if a user wants to search for "Äffin" he might
enter in "Äffin" or he might enter in "Aeffin". The normalized form of
"Aeffin" is still "Aeffin" so how would the search engine ever find a
hit since it indexed the normalized form of "Äffin"?

Thanks for the help.
-Wade

kcobra · Jun 3, 2008

The company I work is doing a website in German. The website has a
search engine for it's content. My question is how to recognize the
various "ways" a user might represent an ISO-8859-1 character in his
search string.

Various articles suggest normalizing the content via one of the
Unicode Normalization Forms when the content is indexed into the
search engine. Then when the search is ran you normalize the user's
search string via the same algorithm.

I pretty much understand all that but what I don't understand is how
to map some of the user entered representations of "special"
characters to the normalized form. For example, according to our
German product owner if a user wants to search for "Äffin" he might
enter in "Äffin" or he might enter in "Aeffin". The normalized form of
"Aeffin" is still "Aeffin" so how would the search engine ever find a
hit since it indexed the normalized form of "Äffin"?

Thanks for the help.
-Wade

Thinking about this some more, I could certainly do a "one-off" where
I run some logic up front to convert 'Ä' to "ae" before I run the
content through the Unicode Normalization libraries. This seems a bit
hackish though as I would need to know the alternate version of every
high bit character in ISO-8859-1.

Also, I'm wondering if different languages that share the same high-
bit characters have different alternate versions. If so this would
make the problem much harder. As you can probably tell I am know
character encoding expert or a multi-lingual speaker.

I know these questions are not directly related to Java but I was
hoping some of you had experience with these issues.

Thanks,
-Wade

Roedy Green · Jun 4, 2008

Thinking about this some more, I could certainly do a "one-off" where
I run some logic up front to convert 'Ä' to "ae" before I run the
content through the Unicode Normalization libraries

That would be safer than going the other way. Otherwise you might
convert Israel to Isräl

I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Unicode questions	17	Oct 19, 2010
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
New Beta FeedBurner Dashboard \| Main \| Daily Search Forum Recap:	1	Oct 28, 2010
Good things come in small packages -Choose AWA s pay per clicktraining programs!	0	May 7, 2014

Understanding Unicode Normalization Forms?

kcobra

kcobra

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads