Reliable character encodings conversion

  • Thread starter Hubert ÅÄ™picki
  • Start date
H

Hubert ÅÄ™picki

SGksCgpJIGFtIGxvb2tpbmcgZm9yIHJlbGlhYmxlIGFuZCBlcnJvci1yZXNpc3RhbnQgd2F5IHRv
IGNvbnZlcnQgY2hhcmFjdGVyCmVuY29kaW5ncyB0byBVVEY4LiBJbnB1dCBlbmNvZGluZ3MgdmFy
eSwgYW5kIEkgaGF2ZSBxdWl0ZSBnb29kIGlucHV0CmVuY29kaW5ncyBkZXRlY3Rpb24gaW4gcGxh
Y2UuCgpJIGFtIHVzaW5nIEljb252IGxpYnJhcnkgd3JhcHBlciB0byBjb252ZXJ0IHRleHRzIHRv
IFVURjgsIGJ1dCBpdCdzCnRocm93aW5nICJJY29udjo6SWxsZWdhbFNlcXVlbmNlIiBleGNlcHRp
b24uIFRoZSBwcm9ibGVtIGlzIHRoYXQgaW5wdXQKdGV4dHMgYXJlIHVzZXItZ2VuZXJhdGVkIGFu
ZCBoYXZlIHNvbWV0aW1lcyBtaXhlZCBjaGFyYWN0ZXJzCmVuY29kaW5ncy4KCkRvZXMgYW55b25l
IGhhdmUgYW55IGV4cGVyaWVuY2Ugd2l0aCB0aGVzZSBraW5kIG9mIHNpdHVhdGlvbnMsIG9yIGNh
bgpzdWdnZXN0IGFsdGVybmF0aXZlIGxpYnJhcmllcz8KClRoYW5rcywKSHViZXJ0CgotLSAKUG96
ZHJhd2lhbSwKSHViZXJ0IMWBxJlwaWNraQogLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0KWyBodHRwOi8vaHViZXJ0bGVwaWNraS5jb20gXQo=
 
J

James Gray

I am using Iconv library wrapper to convert texts to UTF8, but it's
throwing "Iconv::IllegalSequence" exception.

You can add a //TRANSLIT to the end of the "to" encoding to have Iconv =20=

attempt to convert characters to reasonable equivalents in that =20
encoding. This is usually more helpful when your input is all one =20
encoding and just has some characters that won't translate well (like =20=

a UTF-8 =E2=80=A6 going to ISO-8859-1).

Your case of mixed encodings is probably best handled with //IGNORE =20
instead, which asks Iconv to skip over any characters that cannot be =20
converted. You will loose some data with this, but it will convert =20
what it can.

You can also use //TRANSLIT//IGNORE to convert what can be converted =20
and skip the rest.

Hope that helps.

James Edward Gray II=
 
J

James Gray

Thanks, //IGNORE//TRANSLIT seems to help a bit - but it's not perfect.

You listed those backwards. Is that really what you tried? Does =20
reversing them make any difference?

James Edward Gray II=
 
M

Marcin Raczkowski

you can use RChardet library,

her'es what i use:

require 'rchardet'

class String
def encoding
@encoding ||= guess_encoding
end

def encoding=(new)
@encoding = new
end

def convert_to(new)
self.replace(Iconv.iconv(new, encoding, self)[0])
@encoding = new
end

def guess_encoding
@encoding = CharDet.guess(self)["encoding"]
end

# this enables "foo".convert :us-ascii => :utf8
def convert(hash)
from = hash.keys[0]
to = hash[from]
self.replace(Iconv.iconv(to, from, self)[0])
end
end

it handles translating preatty well :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top