Non-English characters

T

Thiago Arrais

Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'=F1' becomes n and '=E2' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.

Cheers,

Thiago Arrais
--=20
Mergulhando no Caos - http://thiagoarrais.wordpress.com
Pensamentos, id=E9ias e devaneios sobre desenvolvimento de software e
tecnologia em geral
 
C

Carlos

Thiago said:
Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'ñ' becomes n and 'â' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.

You could use the "unicode" library, by Yoshida Masato.
http://www.yoshidam.net/Ruby.html

Example:

$ cat uni.rb
require 'unicode'
txt = 'ñ÷åòôùõéïðÁÓÄÆÇÈÊËÌ!@#*$%^&'
puts Unicode.decompose(txt).delete('^0-9A-Za-z')

$ ruby uni.rb
naoouoeiAOACEEEI


Good luck.
--
 
X

Xavier Noria

Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'=F1' becomes n and '=E2' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.

The best approach I've seen[*] is to decompose and map to ASCII:

Iconv.iconv('ascii//ignore//translit', 'utf-8', str)

and then sanitize.

I think this is better than the technique that passes through Unicode =20=

decomposition because it also handles =DF (ss), =80 (EUR), =E6 (ae), =9C =
=20
(oe), etc.

-- fxn

[*] Seen in the source of the Rails plugin acts_as_friendly_param, =20
which in turn takes the idea from Mephisto.=
 
P

Paul Duncan

--jousvV0MzM2p6OtC
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Thiago Arrais ([email protected]) said:
Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'=F1' becomes n and '=E2' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

I thought iconv transliteration might do this, but it doesn't:

require 'iconv'
i =3D Iconv.new('ascii//TRANSLIT//IGNORE, 'iso-8859-1')
i.iconv('=F1')
=3D> "?"

A bit of googling turns up the following:

* Text::Unaccent, a Perl module available via CPAN
(http://search.cpan.org/~ldachary/Text-Unaccent-1.08/Unaccent.pm)
* unac, a GNU utility (and library) that removes accents from
characters. (http://home.gna.org/unac/unac-man3.en.html)

Both work roughly the same way; they use iconv to convert the source
string to UTF-16BE, followed by a mapping table to map accented
characters to their non-accented equivalents.
=20
The unac link above has a bit more information about how these mapping
tables are generated; basically they have a script that parses a unicode
data file at build time and generates the mapping table.

The mapping table is available here:
http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt

So anyway, the answers to your question appears to be: =20

* If you're just converting a couple of characters one time, just=20
use a regular expression.
* If you're looking to convert an arbitrary number of characters one
time and have access to a machine with GNU tools, just use unac.
* If you're not particular about the language, use the Perl library.
* If you don't mind installing the unac library, I wrote a quick wrapper
for it. See below for more.

This has been discussed on ruby-talk before:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/96626

I wrote a quick binding for the unac library, you can grab it from here:

http://pablotron.org/files/unac-ruby-0.1.0.tar.gz (tarball)
http://pablotron.org/files/unac-ruby-0.1.0.tar.gz.asc (PGP Signature)
http://hg.pablotron.org/unac-ruby (Mercurial Repository)

If you're interested, I can probably write a pure-Ruby version
relatively quickly too.
Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.
=20
Cheers,
=20
Thiago Arrais

--=20
Paul Duncan <[email protected]> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

--jousvV0MzM2p6OtC
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFF0g+fzdlT34LClWIRAmHRAJ0UP0qBGVoAQ4E8U5JxspIs1m9e4QCfT+4s
2SLUsNYac72cq+tQ3laNovU=
=gPrt
-----END PGP SIGNATURE-----

--jousvV0MzM2p6OtC--
 
X

Xavier Noria

I thought iconv transliteration might do this, but it doesn't:

require 'iconv'
i =3D Iconv.new('ascii//TRANSLIT//IGNORE, 'iso-8859-1')
i.iconv('=F1')
=3D> "?"

Looks like your source code was not iso-8859-1, because it works:

require 'iconv'
puts Iconv.iconv('ascii//ignore//translit', 'iso-8859-1', "=F1")
=3D> ~n

It works in UTF8 as well:

$KCODE =3D 'u'
require 'iconv'

puts Iconv.iconv('ascii//ignore//translit', 'utf-8', "=F1")
=3D> ~n

-- fxn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top