removing diacritical marks

Discussion in 'Ruby' started by Paul Barry, Mar 17, 2006.

  1. Paul Barry

    Paul Barry Guest

    ------=_Part_917_28172063.1142567070112
    Content-Type: text/plain; charset=ISO-8859-1
    Content-Transfer-Encoding: quoted-printable
    Content-Disposition: inline

    Hello Rubyists,

    I would like to remove the accents marks (a.k.a diacritical marks) from a
    String. Assuming "line" is a String, this gets most of them:

    line.gsub!(/[=C0=C1=C2=C3=C4]/,"A")
    line.gsub!(/[=E2=E3=E4=E0=E1]/,"a")
    line.gsub!(/[=C8=C9=CA=CB]/,"E")
    line.gsub!(/[=EA=EB=E8=E9]/,"e")
    line.gsub!(/[=CC=CD=CE=CF]/,"I")
    line.gsub!(/[=EE=EF=EC=ED]/,"i")
    line.gsub!(/[=D2=D3=D4=D5=D6]/,"O")
    line.gsub!(/[=F4=F5=F6=F2=F3]/,"o")
    line.gsub!(/[=D9=DA=DB=DC]/,"U")
    line.gsub!(/[=FB=FC=F9=FA]/,"u")
    line.gsub!(/=DD/,"Y")
    line.gsub!(/=FD/,"y")
    line.gsub!(/=F1/,"n")

    Is there an easier/better way to do this?

    ------=_Part_917_28172063.1142567070112--
     
    Paul Barry, Mar 17, 2006
    #1
    1. Advertising

  2. Paul Barry

    Dave Burt Guest

    Paul Barry wrote:
    > I would like to remove the accents marks (a.k.a diacritical marks) from a
    > String. Assuming "line" is a String, this gets most of them:
    >
    > line.gsub!(/[ÀÁÂÃÄ]/,"A")
    > ...
    > Is there an easier/better way to do this?


    Yes. There's a potential problem with your way: if the accented characters
    are more than one byte (i.e. in any character set other than ASCII) each
    byte will be replaced with an A: "À" => "AA".

    This is safer: line.gsub!(/À|Á|Â|Ã|Ä]/,"A")

    I translated a method to do this from PHP earlier this year:
    http://tinyurl.com/q8hlg [Google Groups]

    Cheers,
    Dave
     
    Dave Burt, Mar 17, 2006
    #2
    1. Advertising

  3. Paul Barry

    Paul Battley Guest

    > I translated a method to do this from PHP earlier this year:
    > http://tinyurl.com/q8hlg [Google Groups]


    Here's a simpler version (hard-coded for UTF-8; it would need some
    tweaking for other encodings). It has a side effect of transliterating
    punctuation to ASCII as well, which may or may not be desirable.

    Paul

    ----

    $KCODE =3D 'u'
    require 'iconv'

    class String
    def strip_diacritics
    self.gsub(/[^\x20-\x7f]/){
    Iconv.iconv('us-ascii//IGNORE//TRANSLIT', 'utf-8',
    $&)[0].sub(/^[\^`'"~](?=3D[a-z])/i, '')
    }
    end
    end

    require 'test/unit'
    class TestStripDiacritics < Test::Unit::TestCase

    def test_upper_case
    assert_equal('AAAAA', '=C0=C1=C2=C3=C4'.strip_diacritics)
    assert_equal('EEEE', '=C8=C9=CA=CB'.strip_diacritics)
    assert_equal('IIII', '=CC=CD=CE=CF'.strip_diacritics)
    assert_equal('OOOOO', '=D2=D3=D4=D5=D6'.strip_diacritics)
    assert_equal('UUUU', '=D9=DA=DB=DC'.strip_diacritics)
    assert_equal('Y', '=DD'.strip_diacritics)
    assert_equal('N', '=D1'.strip_diacritics)
    end

    def test_lower_case
    assert_equal('aaaaa', '=E2=E3=E4=E0=E1'.strip_diacritics)
    assert_equal('eeee', '=EA=EB=E8=E9'.strip_diacritics)
    assert_equal('iiii', '=EE=EF=EC=ED'.strip_diacritics)
    assert_equal('ooooo', '=F4=F5=F6=F2=F3'.strip_diacritics)
    assert_equal('uuuu', '=FB=FC=F9=FA'.strip_diacritics)
    assert_equal('y', '=FD'.strip_diacritics)
    assert_equal('n', '=F1'.strip_diacritics)
    end

    def test_words
    assert_equal('Internationalizaetion',
    'I=F1t=EBrn=E2ti=F4n=E0liz=E6ti=F8n'.strip_diacritics)
    end

    def test_punctuation
    assert_equal('-', '=97'.strip_diacritics)
    assert_equal("''", "''".strip_diacritics)
    end
    end
     
    Paul Battley, Mar 17, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. adamskim

    French diacritical marks

    adamskim, Dec 13, 2004, in forum: Java
    Replies:
    4
    Views:
    696
    Real Gagnon
    Dec 13, 2004
  2. Girish Sharma

    Diacritical marks in HTML?

    Girish Sharma, Nov 27, 2004, in forum: HTML
    Replies:
    11
    Views:
    4,065
    Jukka K. Korpela
    Dec 1, 2004
  3. Dado
    Replies:
    5
    Views:
    1,084
  4. Berteun Damman

    textwrap and combining diacritical marks

    Berteun Damman, Jun 28, 2007, in forum: Python
    Replies:
    1
    Views:
    350
    Berteun Damman
    Jun 28, 2007
  5. jiverbean

    Diacritical marks in array don't translate

    jiverbean, Nov 11, 2005, in forum: Javascript
    Replies:
    15
    Views:
    238
    Dag Sunde
    Nov 12, 2005
Loading...

Share This Page