[ENCODING] UTF8 hell

Xavier Noëlle · Feb 2, 2010

Hello,
I'm trying to deal with Ruby flaws with encoding, which I thought
would be almost past with Ruby 1.9. I managed to find a solution for
Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !

I fetch rows from an UTF8 database and try to work with the string. To
do so, I would like it to be UTF8 encoded.

"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
lines would solve the problem
str.replace(Iconv.iconv("UTF8", "ascii", self).join())
OR
self.encode!('UTF-8')

But they don't !
First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
(Encoding::UndefinedConversionError)

The base string is "Oeuvre compl=E8te pour luth" and displays well in PHPMy=
Admin.

Any idea ?
TIA,

--=20
Xavier NOELLE

Stefano Crocco · Feb 2, 2010

|Hello,
|I'm trying to deal with Ruby flaws with encoding, which I thought
|would be almost past with Ruby 1.9. I managed to find a solution for
|Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !
|
|I fetch rows from an UTF8 database and try to work with the string. To
|do so, I would like it to be UTF8 encoded.
|
|"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
|lines would solve the problem
|str.replace(Iconv.iconv("UTF8", "ascii", self).join())
|OR
|self.encode!('UTF-8')
|
|But they don't !
|First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
|Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
|(Encoding::UndefinedConversionError)
|
|The base string is "Oeuvre compl=E8te pour luth" and displays well in
|PHPMyAdmin.
|
|Any idea ?
|TIA,

I'm not sure, but basing on my experience, it may be that the string are=20
indeed stored as UTF-8, but the library you use to read from the database=20
doesn't take care of informing ruby of the fact, so ruby assumes it is a=20
generic array of bytes (which means, ruby thinks the string has encoding=20
ASCII-8BIT, which is the same as BINARY).

If this is the case, you don't need to transcode the string (which is what=
=20
encode does), but simply tell ruby which is the correct encoding, using the=
=20
force_encoding method.

I hope this helps

Stefano

David Palm · Feb 2, 2010

I fetch rows from an UTF8 database and try to work with the string. To

do so, I would like it to be UTF8 encoded.

There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

self.encode!('UTF-8')

str.force_encoding('UTF-8') is what you want to use I think.

Xavier Noëlle · Feb 2, 2010

2010/2/2 David Palm said:
There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

Not a Rails app

str.force_encoding('UTF-8') is what you want to use I think.

I already tried this method, but it lead me to the following error: in
`downcase!': invalid byte sequence in UTF-8 (ArgumentError).

This is due to a call to str.downcase!() later in the application.

Any idea to solve this ?

Robert Klemme · Feb 2, 2010

is utf8, doublecheck that the client connection is utf8 as well ("encoding:=
utf8" in database.yml for a Rails app I think).

Not a Rails app

I already tried this method, but it lead me to the following error: in
`downcase!': invalid byte sequence in UTF-8 (ArgumentError).

This is due to a call to str.downcase!() later in the application.

Any idea to solve this ?

You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String's byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Xavier Noëlle · Feb 23, 2010

2010/2/2 Robert Klemme said:
You probably first want to find out whether the byte sequence is valid
UTF-8 or not. =A0For that you would need to look at the bytes in the
String. =A0I guess chances are that your String's byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

Kind regards

robert

I dug into the problem and ended up with this line: self.force_encoding('UT=
F-8')
Believing that the string #encoding was right was a wrong choice, then
I assumed the database provided valid UTF8 strings.

BUT (because, there's a but...), for some reason I don't understand,
some strings are unwilling to work:

Example:
puts self =3D> m=E9dicals
self.each_byte {|b| print "#{b} "} =3D> 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).

Where am I wrong ?

TIA,

--=20
Xavier NOELLE

Marc Heiler · Feb 23, 2010

How does python solve this?

Rick DeNatale · Feb 23, 2010

Hi,

In message "Re: [ENCODING] UTF8 hell"
=A0 =A0on Tue, 23 Feb 2010 20:10:20 +0900, Xavier No=EBlle <xavier.noelle=

@gmail.com said:
|self.each_byte {|b| print "#{b} "} =3D> 109 233 100 105 99 97 108 115
|
|233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
|self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
|byte sequence in UTF-8 (ArgumentError).

233 is not a valid UTF-8 character. =A0The byte sequence for m=E9dicals i= s
<109 195 169 100 105 99 97 108 115>.

233 for e accent acute would be valid for ISO-8859-1 encoding, not UTF-8.

--=20
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

Xavier Noëlle · Feb 23, 2010

2010/2/23 Yukihiro Matsumoto said:
233 is not a valid UTF-8 character. =A0The byte sequence for m=E9dicals i= s
<109 195 169 100 105 99 97 108 115>.

Indeed. In the meantime, I changed the code with this one:
def isUTF8()
begin
self.unpack('U*')
rescue
return false
end
return true
end

if isUTF8()
self.force_encoding('UTF-8')
else
self.force_encoding('ISO-8859-1')
self.encode!('UTF-8')
end

This (ugly) quickfix works for what I need, but I don't know if this
problem can be somehow resolved in another way. The problem being that
my SQL database has a VARBINARY column with an unknown encoding. Is
there a way to deal with the various possible encoding or to ask MySQL
to return UTF8 converted data, or is it necessary to clean data before
inserting them ?

--=20
Xavier NOELLE

Perry Smith · Feb 23, 2010

A general hint for debugging encoding troubles: the UTF-8 encoding

*guarantees* that every Unicode codepoint is *either* encoded into a
*single* octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
octets, *all* of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).

Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
or 6 but not 3 nor 5 octects?

JÃ¶rg W Mittag · Feb 23, 2010

Perry said:
Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
or 6 but not 3 nor 5 octects?

Nope.

First off: I was wrong, the longest encoding is actually 4 octets,
not 6. (I was confused by the algorithm: the algorithm actually allows
for up to 8 bytes, but because of the way Unicode characters are
allocated, and UTF-8 is defined, it is guaranteed that there will
never be more than 4.)

The encodings look like this:

0xxxxxxx for ASCII
110xxxxx 10xxxxxx for U+80 to U+7FF
1110xxxx 10xxxxxx 10xxxxxx for U+800 to U+FFFF and
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for U+1000 to U+1FFFFF

This is actually pretty clever:

* you can always tell whether you are inside a multibyte sequence or
not because of the high bit,
* you can always tell whether a byte in the sequence is the first one
or a later one, because the first one always starts with 11 and the
other ones always start with 10 and
* you can always tell how long a sequence is by the number of 1 bits
in the start byte: two-byte sequences start with two 1s, three-byte
sequences start with three 1s and four-byte sequences start with
four 1s.

This means that you can usually re-synchronize pretty easily from the
middle of a corrupted network transmission, for example. You can also
jump over bytes if you are counting the length.

jwm

Robert Klemme · Feb 23, 2010

I dug into the problem and ended up with this line: self.force_encoding('UTF-8')
Believing that the string #encoding was right was a wrong choice, then
I assumed the database provided valid UTF8 strings.

The string you show below does not look like UTF-8 encoded, probably
rather ISO-8859-1 or such. If you enforce an encoding you leave the
byte sequence untouched. This leads to the kind of error you describe
below.

BUT (because, there's a but...), for some reason I don't understand,
some strings are unwilling to work:

Example:
puts self => médicals
self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).

Where am I wrong ?

As far as I can see 233 starts a three byte sequence

http://en.wikipedia.org/wiki/UTF-8#Description

I did not dig deeper but it may be that by forcing UTF-8 on an ISO
something encoded string you broke it.

Kind regards

robert

Michael Fellinger · Feb 24, 2010

Indeed. In the meantime, I changed the code with this one:
def isUTF8()
=C2=A0begin
=C2=A0 =C2=A0self.unpack('U*')
=C2=A0rescue
=C2=A0 =C2=A0return false
=C2=A0end
=C2=A0return true
end

if isUTF8()
=C2=A0self.force_encoding('UTF-8')
else
=C2=A0self.force_encoding('ISO-8859-1')
=C2=A0self.encode!('UTF-8')
end

string =3D "\xE8te pour luth"
# "\xE8te pour luth"
string.encoding
# #<Encoding:UTF-8>
string.valid_encoding?
# false
string.force_encoding('ISO-8859-1')
# "=C3=A8te pour luth"
string.valid_encoding?
# true
string.upcase
# "=C3=A8TE POUR LUTH"

This (ugly) quickfix works for what I need, but I don't know if this
problem can be somehow resolved in another way. The problem being that
my SQL database has a VARBINARY column with an unknown encoding. Is
there a way to deal with the various possible encoding or to ask MySQL
to return UTF8 converted data, or is it necessary to clean data before
inserting them ?

--=20
Michael Fellinger
CTO, The Rubyists, LLC
972-996-5199

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Encoding::CompatibilityError	3	Aug 20, 2009
Ruby1.9: Encoding problems (how to use #force_encoding ?)	5	Sep 1, 2009
1.9, C extension vs encoding	1	Aug 20, 2008
MySql+UTF8 woes	0	Jul 26, 2007
US-ASCII to UTF-8	2	Mar 9, 2010
R1.9 mixed encoding in file	10	Aug 7, 2009
to_yaml in utf-8 encoding	7	Apr 8, 2011

[ENCODING] UTF8 hell

Xavier Noëlle

Stefano Crocco

David Palm

Xavier Noëlle

Robert Klemme

Xavier Noëlle

Marc Heiler

Rick DeNatale

Xavier Noëlle

Perry Smith

JÃ¶rg W Mittag

Robert Klemme

Michael Fellinger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads