Text search with accented characters

Discussion in 'Java' started by Mickey Segal, Dec 15, 2005.

  1. Mickey Segal

    Mickey Segal Guest

    Does Java have a method to take a string with accented characters and
    convert it to unaccented characters? I want to search a big string for a
    test string, ignoring accents on characters.

    Doing the equivalent ignoring of case is simple:

    String actualTestString = testString.toLowerCase();
    String actualBigString = bigString.toLowerCase();
    if (actualBigString.lastIndexOf(actualTestString) >= 0)
    {
    // do stuff
    }

    In the Collator class I see a way of checking if two strings are equivalent,
    disregarding both case and accents:

    Collator c = Collator.getInstance();
    c.setStrength(Collator.PRIMARY); // ignore both case and accents
    if (c.compare(oneString, otherString) == 0)
    {
    //do stuff
    }

    However, I don't see a way of reducing the accented string to a simpler
    string so I could search in a bigger string using a "toUnaccentedForm"
    method instead of the toLowerCase method in the code above.

    Is there a built-in method like "toUnaccentedForm" or some other approach
    simpler than writing one's own version of lastIndexOf to ignore accents?
     
    Mickey Segal, Dec 15, 2005
    #1
    1. Advertising

  2. Mickey Segal

    Oliver Wong Guest

    "Mickey Segal" <> wrote in message
    news:...
    > Does Java have a method to take a string with accented characters and
    > convert it to unaccented characters? I want to search a big string for a
    > test string, ignoring accents on characters.
    >
    > Doing the equivalent ignoring of case is simple:
    >
    > String actualTestString = testString.toLowerCase();
    > String actualBigString = bigString.toLowerCase();
    > if (actualBigString.lastIndexOf(actualTestString) >= 0)
    > {
    > // do stuff
    > }
    >
    > In the Collator class I see a way of checking if two strings are
    > equivalent, disregarding both case and accents:
    >
    > Collator c = Collator.getInstance();
    > c.setStrength(Collator.PRIMARY); // ignore both case and accents
    > if (c.compare(oneString, otherString) == 0)
    > {
    > //do stuff
    > }
    >
    > However, I don't see a way of reducing the accented string to a simpler
    > string so I could search in a bigger string using a "toUnaccentedForm"
    > method instead of the toLowerCase method in the code above.
    >
    > Is there a built-in method like "toUnaccentedForm" or some other approach
    > simpler than writing one's own version of lastIndexOf to ignore accents?


    AFAIK, there is no built in "toUnaccentedForm()". What you can do that
    might be less painful than implementing your own lastIndexOf() is to built a
    Map of characters that goes from the accented version to the unaccented
    version, and then transforms your string using that map, and THEN do the
    comparison.

    - Oliver
     
    Oliver Wong, Dec 15, 2005
    #2
    1. Advertising

  3. Mickey Segal

    Mickey Segal Guest

    "Oliver Wong" <> wrote in message
    news:eek:ymof.2297$lv3.1552@clgrps12...
    > AFAIK, there is no built in "toUnaccentedForm()". What you can do that
    > might be less painful than implementing your own lastIndexOf() is to built
    > a Map of characters that goes from the accented version to the unaccented
    > version, and then transforms your string using that map, and THEN do the
    > comparison.


    I came to the same conclusion, mapping the 10 non-standard lower-case
    characters likely to come up in our database. Since I was also using
    toLowerCase this also covered the upper-case forms.

    I also fiddled around with writing my own equivalent of lastIndexOf() using
    CollationElementIterator after finding an example at
    http://icu.sourceforge.net/docs/papers/efficient_text_searching_in_java.html.
    However in the real world that approached turned out to be painfully slow
    when searching 1000 strings. In contrast, the approach of mapping 10
    characters was very fast because the characters are very rare in our
    database so the handling of accented characters did not slow down the
    program much.
     
    Mickey Segal, Dec 16, 2005
    #3
  4. Mickey Segal

    Roedy Green Guest

    On Thu, 15 Dec 2005 14:57:53 -0500, "Mickey Segal"
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Does Java have a method to take a string with accented characters and
    >convert it to unaccented characters? I want to search a big string for a
    >test string, ignoring accents on characters.


    There is one in Abundance, but I don't think I have seen one in Java.
    The way you implement it is with a translate table. You index by
    accented char to get unaccented. You might just implement it for low
    numbered chars.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Dec 16, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Davide Benini

    accented characters

    Davide Benini, Jun 1, 2005, in forum: XML
    Replies:
    4
    Views:
    814
    David Carlisle
    Jun 1, 2005
  2. Mark Drummond

    Dealing with accented characters

    Mark Drummond, May 31, 2006, in forum: Perl
    Replies:
    0
    Views:
    2,923
    Mark Drummond
    May 31, 2006
  3. Fuzzyman

    Problems With Accented Characters

    Fuzzyman, Feb 22, 2004, in forum: Python
    Replies:
    1
    Views:
    403
    Fuzzyman
    Feb 23, 2004
  4. Stephen Boulet
    Replies:
    3
    Views:
    405
    Terry Reedy
    Jul 16, 2004
  5. Rob
    Replies:
    3
    Views:
    168
Loading...

Share This Page