removing diacritical marks

P

Paul Barry

------=_Part_917_28172063.1142567070112
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hello Rubyists,

I would like to remove the accents marks (a.k.a diacritical marks) from a
String. Assuming "line" is a String, this gets most of them:

line.gsub!(/[=C0=C1=C2=C3=C4]/,"A")
line.gsub!(/[=E2=E3=E4=E0=E1]/,"a")
line.gsub!(/[=C8=C9=CA=CB]/,"E")
line.gsub!(/[=EA=EB=E8=E9]/,"e")
line.gsub!(/[=CC=CD=CE=CF]/,"I")
line.gsub!(/[=EE=EF=EC=ED]/,"i")
line.gsub!(/[=D2=D3=D4=D5=D6]/,"O")
line.gsub!(/[=F4=F5=F6=F2=F3]/,"o")
line.gsub!(/[=D9=DA=DB=DC]/,"U")
line.gsub!(/[=FB=FC=F9=FA]/,"u")
line.gsub!(/=DD/,"Y")
line.gsub!(/=FD/,"y")
line.gsub!(/=F1/,"n")

Is there an easier/better way to do this?

------=_Part_917_28172063.1142567070112--
 
D

Dave Burt

Paul said:
I would like to remove the accents marks (a.k.a diacritical marks) from a
String. Assuming "line" is a String, this gets most of them:

line.gsub!(/[ÀÁÂÃÄ]/,"A")
...
Is there an easier/better way to do this?

Yes. There's a potential problem with your way: if the accented characters
are more than one byte (i.e. in any character set other than ASCII) each
byte will be replaced with an A: "À" => "AA".

This is safer: line.gsub!(/À|Á|Â|Ã|Ä]/,"A")

I translated a method to do this from PHP earlier this year:
http://tinyurl.com/q8hlg [Google Groups]

Cheers,
Dave
 
P

Paul Battley

I translated a method to do this from PHP earlier this year:

Here's a simpler version (hard-coded for UTF-8; it would need some
tweaking for other encodings). It has a side effect of transliterating
punctuation to ASCII as well, which may or may not be desirable.

Paul

----

$KCODE =3D 'u'
require 'iconv'

class String
def strip_diacritics
self.gsub(/[^\x20-\x7f]/){
Iconv.iconv('us-ascii//IGNORE//TRANSLIT', 'utf-8',
$&)[0].sub(/^[\^`'"~](?=3D[a-z])/i, '')
}
end
end

require 'test/unit'
class TestStripDiacritics < Test::Unit::TestCase

def test_upper_case
assert_equal('AAAAA', '=C0=C1=C2=C3=C4'.strip_diacritics)
assert_equal('EEEE', '=C8=C9=CA=CB'.strip_diacritics)
assert_equal('IIII', '=CC=CD=CE=CF'.strip_diacritics)
assert_equal('OOOOO', '=D2=D3=D4=D5=D6'.strip_diacritics)
assert_equal('UUUU', '=D9=DA=DB=DC'.strip_diacritics)
assert_equal('Y', '=DD'.strip_diacritics)
assert_equal('N', '=D1'.strip_diacritics)
end

def test_lower_case
assert_equal('aaaaa', '=E2=E3=E4=E0=E1'.strip_diacritics)
assert_equal('eeee', '=EA=EB=E8=E9'.strip_diacritics)
assert_equal('iiii', '=EE=EF=EC=ED'.strip_diacritics)
assert_equal('ooooo', '=F4=F5=F6=F2=F3'.strip_diacritics)
assert_equal('uuuu', '=FB=FC=F9=FA'.strip_diacritics)
assert_equal('y', '=FD'.strip_diacritics)
assert_equal('n', '=F1'.strip_diacritics)
end

def test_words
assert_equal('Internationalizaetion',
'I=F1t=EBrn=E2ti=F4n=E0liz=E6ti=F8n'.strip_diacritics)
end

def test_punctuation
assert_equal('-', '=97'.strip_diacritics)
assert_equal("''", "''".strip_diacritics)
end
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top