RegExp to strip accents while ignoring case

Discussion in 'ASP .Net' started by Jon Maz, Jun 14, 2004.

  1. Jon Maz

    Jon Maz Guest

    Hi All,

    I want to strip the accents off characters in a string so that, for example,
    the (Spanish) word "práctico" comes out as "practico" - but ignoring case,
    so that "PRÁCTICO" comes out as "PRACTICO".

    What's the best way to do this?

    TIA,

    JON

    --------------------------------------------------

    PS First posted to aspmessageboard -
    http://www.aspmessageboard.com/forum/regularExpressions.asp?M=705936&T=705936&F=34&P=1 -
    no answers yet

    PPS The Javascript function that I'm porting to C# looks like this:

    function quitaAcentos(a) {
    re=new RegExp("á", "gi")
    a=a.replace(re, "A")
    re=new RegExp("´é", "gi")
    a=a.replace(re, "E")
    re=new RegExp("í", "gi")
    a=a.replace(re, "I")
    re=new RegExp("ó", "gi")
    a=a.replace(re, "O")
    re=new RegExp("ú", "gi")
    a=a.replace(re, "U")
    re=new RegExp("à", "gi")
    a=a.replace(re, "A")
    re=new RegExp("è", "gi")
    a=a.replace(re, "E")
    re=new RegExp("é", "gi")
    a=a.replace(re, "E")
    re=new RegExp("ì", "gi")
    a=a.replace(re, "I")
    re=new RegExp("ò", "gi")
    a=a.replace(re, "O")
    re=new RegExp("ó", "gi")
    a=a.replace(re, "O")
    re=new RegExp("ù", "gi")
    a=a.replace(re, "U")
    re=new RegExp("â", "gi")
    a=a.replace(re, "A")
    re=new RegExp("´ê", "gi")
    a=a.replace(re, "E")
    re=new RegExp("î", "gi")
    a=a.replace(re, "I")
    re=new RegExp("ô", "gi")
    a=a.replace(re, "O")
    re=new RegExp("û", "gi")
    a=a.replace(re, "U")
    re=new RegExp("ä", "gi")
    a=a.replace(re, "A")
    re=new RegExp("´ë", "gi")
    a=a.replace(re, "E")
    re=new RegExp("ï", "gi")
    a=a.replace(re, "I")
    re=new RegExp("ö", "gi")
    a=a.replace(re, "O")
    re=new RegExp("ü", "gi")
    a=a.replace(re, "U")
    re=new RegExp(" ", "gi")
    a=a.replace(re, "")
    re=new RegExp("_", "gi")
    a=a.replace(re, "")
    re=new RegExp("ñ", "gi")
    a=a.replace(re, "N")

    return a
    }
    Jon Maz, Jun 14, 2004
    #1
    1. Advertising

  2. Hi Jon,

    I have no idea if this works for all your cases, but, what you essentially want is the basic ASCII characters from a string. I believe that accented characters are all in the extended ascii set and just stripping away the most significant bit will leave you with the unaccented basic character. However, this varies with different code pages. But for all characters I have tested, codepage 1251 will convert correctly except æ Æ

    string s = "áàäãâåéèëêíìïîóòöõôøúùüûýÿ";
    byte[] b = Encoding.GetEncoding(1251).GetBytes(s); // 8 bit characters
    string t = Encoding.ASCII.GetString(b); // 7 bit characters

    t == aaaaaaeeeeiiiioooooouuuuyy



    --
    Happy coding!
    Morten Wennevik [C# MVP]
    Morten Wennevik, Jun 14, 2004
    #2
    1. Advertising

  3. Jon Maz

    Hans Kesting Guest

    "Morten Wennevik" <> wrote in message news:eek:pr9k4y6o1klbvpo@morten_x.edunord...
    > Hi Jon,
    >
    > I have no idea if this works for all your cases, but, what you essentially want is the basic ASCII characters from a string. I

    believe that accented characters are all in the extended ascii set and just stripping away the most significant bit will leave you
    with the unaccented basic character. However, this varies with different code pages. But for all characters I have tested,
    codepage 1251 will convert correctly except æ Æ
    >
    > string s = "áàäãâåéèëêíìïîóòöõôøúùüûýÿ";
    > byte[] b = Encoding.GetEncoding(1251).GetBytes(s); // 8 bit characters
    > string t = Encoding.ASCII.GetString(b); // 7 bit characters
    >
    > t == aaaaaaeeeeiiiioooooouuuuyy
    >
    >
    >
    > --
    > Happy coding!
    > Morten Wennevik [C# MVP]


    Morten,

    I have not tried your code, so it could still work. But the reason will then be the
    conversion within GetBytes/GetString, not your explanation!

    If it is just a matter of "stripping the most significant bit" then that bit can be thought
    of to mean "use an accent". But that would mean that there is just one accented "a"
    (and clearly there are more).
    Or to put it another way: stripping that bit equals "subtract 128" from the character
    code. If you start out with different codes (for the various accents) then you can't
    end up with just one "a".

    Hans Kesting
    Hans Kesting, Jun 14, 2004
    #3
  4. You are correct, in fact, the conversion to 7-bit is entirely irrelevant as the byte array contains the non accented characters. This strikes me as slightly odd as I would expect the byte array to contain the characters in 8-bit, using the 1251 character set.

    --
    Happy coding!
    Morten Wennevik [C# MVP]
    Morten Wennevik, Jun 14, 2004
    #4
  5. Jon Maz

    Jon Maz Guest

    Hi,

    >But for all characters I have tested,
    >codepage 1251 will convert correctly
    >except æ Æ


    Any reason to think there might be some other characters not covered by
    Morten's method?

    Thanks to all for the help!

    JON
    Jon Maz, Jun 15, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Aquila
    Replies:
    35
    Views:
    431
    Mathieu Bouchard
    Mar 31, 2005
  2. Joao Silva
    Replies:
    16
    Views:
    328
    7stud --
    Aug 21, 2009
  3. An. Valula
    Replies:
    3
    Views:
    166
    Alan J. Flavell
    Oct 23, 2003
  4. yelipolok
    Replies:
    4
    Views:
    232
    John W. Krahn
    Jan 27, 2010
  5. PerlFAQ Server
    Replies:
    0
    Views:
    293
    PerlFAQ Server
    Feb 8, 2011
Loading...

Share This Page