Text search with accented characters

M

Mickey Segal

Does Java have a method to take a string with accented characters and
convert it to unaccented characters? I want to search a big string for a
test string, ignoring accents on characters.

Doing the equivalent ignoring of case is simple:

String actualTestString = testString.toLowerCase();
String actualBigString = bigString.toLowerCase();
if (actualBigString.lastIndexOf(actualTestString) >= 0)
{
// do stuff
}

In the Collator class I see a way of checking if two strings are equivalent,
disregarding both case and accents:

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY); // ignore both case and accents
if (c.compare(oneString, otherString) == 0)
{
//do stuff
}

However, I don't see a way of reducing the accented string to a simpler
string so I could search in a bigger string using a "toUnaccentedForm"
method instead of the toLowerCase method in the code above.

Is there a built-in method like "toUnaccentedForm" or some other approach
simpler than writing one's own version of lastIndexOf to ignore accents?
 
O

Oliver Wong

Mickey Segal said:
Does Java have a method to take a string with accented characters and
convert it to unaccented characters? I want to search a big string for a
test string, ignoring accents on characters.

Doing the equivalent ignoring of case is simple:

String actualTestString = testString.toLowerCase();
String actualBigString = bigString.toLowerCase();
if (actualBigString.lastIndexOf(actualTestString) >= 0)
{
// do stuff
}

In the Collator class I see a way of checking if two strings are
equivalent, disregarding both case and accents:

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY); // ignore both case and accents
if (c.compare(oneString, otherString) == 0)
{
//do stuff
}

However, I don't see a way of reducing the accented string to a simpler
string so I could search in a bigger string using a "toUnaccentedForm"
method instead of the toLowerCase method in the code above.

Is there a built-in method like "toUnaccentedForm" or some other approach
simpler than writing one's own version of lastIndexOf to ignore accents?

AFAIK, there is no built in "toUnaccentedForm()". What you can do that
might be less painful than implementing your own lastIndexOf() is to built a
Map of characters that goes from the accented version to the unaccented
version, and then transforms your string using that map, and THEN do the
comparison.

- Oliver
 
M

Mickey Segal

Oliver Wong said:
AFAIK, there is no built in "toUnaccentedForm()". What you can do that
might be less painful than implementing your own lastIndexOf() is to built
a Map of characters that goes from the accented version to the unaccented
version, and then transforms your string using that map, and THEN do the
comparison.

I came to the same conclusion, mapping the 10 non-standard lower-case
characters likely to come up in our database. Since I was also using
toLowerCase this also covered the upper-case forms.

I also fiddled around with writing my own equivalent of lastIndexOf() using
CollationElementIterator after finding an example at
http://icu.sourceforge.net/docs/papers/efficient_text_searching_in_java.html.
However in the real world that approached turned out to be painfully slow
when searching 1000 strings. In contrast, the approach of mapping 10
characters was very fast because the characters are very rare in our
database so the handling of accented characters did not slow down the
program much.
 
R

Roedy Green

Does Java have a method to take a string with accented characters and
convert it to unaccented characters? I want to search a big string for a
test string, ignoring accents on characters.

There is one in Abundance, but I don't think I have seen one in Java.
The way you implement it is with a translate table. You index by
accented char to get unaccented. You might just implement it for low
numbered chars.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top