Non-English characters

Thiago Arrais · Feb 13, 2007

Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'=F1' becomes n and '=E2' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.

Cheers,

Thiago Arrais
--=20
Mergulhando no Caos - http://thiagoarrais.wordpress.com
Pensamentos, id=E9ias e devaneios sobre desenvolvimento de software e
tecnologia em geral

Carlos · Feb 13, 2007

Thiago said:
Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'ñ' becomes n and 'â' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.

You could use the "unicode" library, by Yoshida Masato.
http://www.yoshidam.net/Ruby.html

Example:

$ cat uni.rb
require 'unicode'
txt = 'ñ÷åòôùõéïðÁÓÄÆÇÈÊËÌ!@#*$%^&'
puts Unicode.decompose(txt).delete('^0-9A-Za-z')

$ ruby uni.rb
naoouoeiAOACEEEI

Good luck.
--

Xavier Noria · Feb 13, 2007

Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'=F1' becomes n and '=E2' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.

The best approach I've seen[*] is to decompose and map to ASCII:

Iconv.iconv('ascii//ignore//translit', 'utf-8', str)

and then sanitize.

I think this is better than the technique that passes through Unicode =20=

decomposition because it also handles =DF (ss), =80 (EUR), =E6 (ae), =9C =
=20
(oe), etc.

-- fxn

[*] Seen in the source of the Rails plugin acts_as_friendly_param, =20
which in turn takes the idea from Mephisto.=

Paul Duncan · Feb 13, 2007

--jousvV0MzM2p6OtC
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Thiago Arrais ([email protected]) said:
Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'=F1' becomes n and '=E2' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

I thought iconv transliteration might do this, but it doesn't:

require 'iconv'
i =3D Iconv.new('ascii//TRANSLIT//IGNORE, 'iso-8859-1')
i.iconv('=F1')
=3D> "?"

A bit of googling turns up the following:

* Text::Unaccent, a Perl module available via CPAN
(http://search.cpan.org/~ldachary/Text-Unaccent-1.08/Unaccent.pm)
* unac, a GNU utility (and library) that removes accents from
characters. (http://home.gna.org/unac/unac-man3.en.html)

Both work roughly the same way; they use iconv to convert the source
string to UTF-16BE, followed by a mapping table to map accented
characters to their non-accented equivalents.
=20
The unac link above has a bit more information about how these mapping
tables are generated; basically they have a script that parses a unicode
data file at build time and generates the mapping table.

The mapping table is available here:
http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt

So anyway, the answers to your question appears to be: =20

* If you're just converting a couple of characters one time, just=20
use a regular expression.
* If you're looking to convert an arbitrary number of characters one
time and have access to a machine with GNU tools, just use unac.
* If you're not particular about the language, use the Perl library.
* If you don't mind installing the unac library, I wrote a quick wrapper
for it. See below for more.

This has been discussed on ruby-talk before:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/96626

I wrote a quick binding for the unac library, you can grab it from here:

http://pablotron.org/files/unac-ruby-0.1.0.tar.gz (tarball)
http://pablotron.org/files/unac-ruby-0.1.0.tar.gz.asc (PGP Signature)
http://hg.pablotron.org/unac-ruby (Mercurial Repository)

If you're interested, I can probably write a pure-Ruby version
relatively quickly too.

Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.
=20
Cheers,
=20
Thiago Arrais

--=20
Paul Duncan <[email protected]> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

--jousvV0MzM2p6OtC
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFF0g+fzdlT34LClWIRAmHRAJ0UP0qBGVoAQ4E8U5JxspIs1m9e4QCfT+4s
2SLUsNYac72cq+tQ3laNovU=
=gPrt
-----END PGP SIGNATURE-----

--jousvV0MzM2p6OtC--

Xavier Noria · Feb 13, 2007

I thought iconv transliteration might do this, but it doesn't:

require 'iconv'
i =3D Iconv.new('ascii//TRANSLIT//IGNORE, 'iso-8859-1')
i.iconv('=F1')
=3D> "?"

Looks like your source code was not iso-8859-1, because it works:

require 'iconv'
puts Iconv.iconv('ascii//ignore//translit', 'iso-8859-1', "=F1")
=3D> ~n

It works in UTF8 as well:

$KCODE =3D 'u'
require 'iconv'

puts Iconv.iconv('ascii//ignore//translit', 'utf-8', "=F1")
=3D> ~n

-- fxn

[ANN] Motiro 0.4.2 released	6	Aug 4, 2006
Qt4 : disappearing non-English characters	0	Dec 9, 2009
[ANN] Motiro 0.5.4 released	0	Dec 15, 2006
Characters non English	3	Feb 14, 2008
Spawning daemon processes	4	Aug 31, 2006
eval and non-English characters conflict?	10	Nov 7, 2006
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Non latin characters in string literals	17	Jan 3, 2010

Non-English characters

Thiago Arrais

Carlos

Xavier Noria

Paul Duncan

Xavier Noria

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads