Text search with accented characters

Mickey Segal · Dec 15, 2005

Does Java have a method to take a string with accented characters and
convert it to unaccented characters? I want to search a big string for a
test string, ignoring accents on characters.

Doing the equivalent ignoring of case is simple:

String actualTestString = testString.toLowerCase();
String actualBigString = bigString.toLowerCase();
if (actualBigString.lastIndexOf(actualTestString) >= 0)
{
// do stuff
}

In the Collator class I see a way of checking if two strings are equivalent,
disregarding both case and accents:

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY); // ignore both case and accents
if (c.compare(oneString, otherString) == 0)
{
//do stuff
}

However, I don't see a way of reducing the accented string to a simpler
string so I could search in a bigger string using a "toUnaccentedForm"
method instead of the toLowerCase method in the code above.

Is there a built-in method like "toUnaccentedForm" or some other approach
simpler than writing one's own version of lastIndexOf to ignore accents?

Oliver Wong · Dec 15, 2005

Mickey Segal said:
Does Java have a method to take a string with accented characters and
convert it to unaccented characters? I want to search a big string for a
test string, ignoring accents on characters.

Doing the equivalent ignoring of case is simple:

String actualTestString = testString.toLowerCase();
String actualBigString = bigString.toLowerCase();
if (actualBigString.lastIndexOf(actualTestString) >= 0)
{
// do stuff
}

In the Collator class I see a way of checking if two strings are
equivalent, disregarding both case and accents:

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY); // ignore both case and accents
if (c.compare(oneString, otherString) == 0)
{
//do stuff
}

However, I don't see a way of reducing the accented string to a simpler
string so I could search in a bigger string using a "toUnaccentedForm"
method instead of the toLowerCase method in the code above.

Is there a built-in method like "toUnaccentedForm" or some other approach
simpler than writing one's own version of lastIndexOf to ignore accents?

AFAIK, there is no built in "toUnaccentedForm()". What you can do that
might be less painful than implementing your own lastIndexOf() is to built a
Map of characters that goes from the accented version to the unaccented
version, and then transforms your string using that map, and THEN do the
comparison.

- Oliver

Mickey Segal · Dec 16, 2005

Oliver Wong said:
AFAIK, there is no built in "toUnaccentedForm()". What you can do that
might be less painful than implementing your own lastIndexOf() is to built
a Map of characters that goes from the accented version to the unaccented
version, and then transforms your string using that map, and THEN do the
comparison.

I came to the same conclusion, mapping the 10 non-standard lower-case
characters likely to come up in our database. Since I was also using
toLowerCase this also covered the upper-case forms.

I also fiddled around with writing my own equivalent of lastIndexOf() using
CollationElementIterator after finding an example at
http://icu.sourceforge.net/docs/papers/efficient_text_searching_in_java.html.
However in the real world that approached turned out to be painfully slow
when searching 1000 strings. In contrast, the approach of mapping 10
characters was very fast because the characters are very rare in our
database so the handling of accented characters did not slow down the
program much.

Roedy Green · Dec 16, 2005

Does Java have a method to take a string with accented characters and
convert it to unaccented characters? I want to search a big string for a
test string, ignoring accents on characters.

There is one in Abundance, but I don't think I have seen one in Java.
The way you implement it is with a translate table. You index by
accented char to get unaccented. You might just implement it for low
numbered chars.

Using MS Index Server to search french Accented characters	3	Aug 21, 2007
Problems With Accented Characters	1	Feb 22, 2004
Querystring with accented characters	10	Dec 17, 2004
Patricia trie vs binary search.	32	May 25, 2012
regexp with accent insensitive ??	3	Oct 12, 2008
Problem with searching in TreeSet expressions starting with a given text	3	Jan 4, 2007
toggle name, With explanations	0	Jul 31, 2012
input (text) onkeypress with Japanese characters	3	Sep 8, 2005

Text search with accented characters

Mickey Segal

Oliver Wong

Mickey Segal

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads