[ENCODING] UTF8 hell

Discussion in 'Ruby' started by Xavier Noëlle, Feb 2, 2010.

  1. Hello,
    I'm trying to deal with Ruby flaws with encoding, which I thought
    would be almost past with Ruby 1.9. I managed to find a solution for
    Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !

    I fetch rows from an UTF8 database and try to work with the string. To
    do so, I would like it to be UTF8 encoded.

    "str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
    lines would solve the problem
    str.replace(Iconv.iconv("UTF8", "ascii", self).join())
    OR
    self.encode!('UTF-8')

    But they don't !
    First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
    Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
    (Encoding::UndefinedConversionError)

    The base string is "Oeuvre compl=E8te pour luth" and displays well in PHPMy=
    Admin.

    Any idea ?
    TIA,

    --=20
    Xavier NOELLE
    Xavier Noëlle, Feb 2, 2010
    #1
    1. Advertising

  2. On Tuesday 02 February 2010, Xavier No=EBlle wrote:
    > |Hello,
    > |I'm trying to deal with Ruby flaws with encoding, which I thought
    > |would be almost past with Ruby 1.9. I managed to find a solution for
    > |Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !
    > |
    > |I fetch rows from an UTF8 database and try to work with the string. To
    > |do so, I would like it to be UTF8 encoded.
    > |
    > |"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
    > |lines would solve the problem
    > |str.replace(Iconv.iconv("UTF8", "ascii", self).join())
    > |OR
    > |self.encode!('UTF-8')
    > |
    > |But they don't !
    > |First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
    > |Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
    > |(Encoding::UndefinedConversionError)
    > |
    > |The base string is "Oeuvre compl=E8te pour luth" and displays well in
    > |PHPMyAdmin.
    > |
    > |Any idea ?
    > |TIA,


    I'm not sure, but basing on my experience, it may be that the string are=20
    indeed stored as UTF-8, but the library you use to read from the database=20
    doesn't take care of informing ruby of the fact, so ruby assumes it is a=20
    generic array of bytes (which means, ruby thinks the string has encoding=20
    ASCII-8BIT, which is the same as BINARY).

    If this is the case, you don't need to transcode the string (which is what=
    =20
    encode does), but simply tell ruby which is the correct encoding, using the=
    =20
    force_encoding method.

    I hope this helps

    Stefano
    Stefano Crocco, Feb 2, 2010
    #2
    1. Advertising

  3. Xavier Noëlle

    David Palm Guest

    > I fetch rows from an UTF8 database and try to work with the string. To
    > do so, I would like it to be UTF8 encoded.


    There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

    > self.encode!('UTF-8')


    str.force_encoding('UTF-8') is what you want to use I think.

    :)
    David Palm, Feb 2, 2010
    #3
  4. 2010/2/2 David Palm <>:
    > There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).


    Not a Rails app :)

    > str.force_encoding('UTF-8') is what you want to use I think.


    I already tried this method, but it lead me to the following error: in
    `downcase!': invalid byte sequence in UTF-8 (ArgumentError).

    This is due to a call to str.downcase!() later in the application.

    Any idea to solve this ? :)

    --
    Xavier NOELLE
    Xavier Noëlle, Feb 2, 2010
    #4
  5. 2010/2/2 Xavier No=EBlle <>:
    > 2010/2/2 David Palm <>:
    >> There are several pieces to this. Even if the DB encoding and collation =

    is utf8, doublecheck that the client connection is utf8 as well ("encoding:=
    utf8" in database.yml for a Rails app I think).
    >
    > Not a Rails app :)
    >
    >> str.force_encoding('UTF-8') is what you want to use I think.

    >
    > I already tried this method, but it lead me to the following error: in
    > `downcase!': invalid byte sequence in UTF-8 (ArgumentError).
    >
    > This is due to a call to str.downcase!() later in the application.
    >
    > Any idea to solve this ? :)


    You probably first want to find out whether the byte sequence is valid
    UTF-8 or not. For that you would need to look at the bytes in the
    String. I guess chances are that your String's byte sequence is NOT
    valid UTF-8 OR you have a character in the string that has no
    lowercase representation.

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Feb 2, 2010
    #5
  6. 2010/2/2 Robert Klemme <>:
    > You probably first want to find out whether the byte sequence is valid
    > UTF-8 or not. =A0For that you would need to look at the bytes in the
    > String. =A0I guess chances are that your String's byte sequence is NOT
    > valid UTF-8 OR you have a character in the string that has no
    > lowercase representation.
    >
    > Kind regards
    >
    > robert


    I dug into the problem and ended up with this line: self.force_encoding('UT=
    F-8')
    Believing that the string #encoding was right was a wrong choice, then
    I assumed the database provided valid UTF8 strings.

    BUT (because, there's a but...), for some reason I don't understand,
    some strings are unwilling to work:

    Example:
    puts self =3D> m=E9dicals
    self.each_byte {|b| print "#{b} "} =3D> 109 233 100 105 99 97 108 115

    233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
    self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
    byte sequence in UTF-8 (ArgumentError).

    Where am I wrong ?

    TIA,

    --=20
    Xavier NOELLE
    Xavier Noëlle, Feb 23, 2010
    #6
  7. Xavier Noëlle

    Marc Heiler Guest

    Marc Heiler, Feb 23, 2010
    #7
  8. On Tue, Feb 23, 2010 at 9:41 AM, Yukihiro Matsumoto <> wr=
    ote:
    > Hi,
    >
    > In message "Re: [ENCODING] UTF8 hell"
    > =A0 =A0on Tue, 23 Feb 2010 20:10:20 +0900, Xavier No=EBlle <xavier.noelle=

    @gmail.com> writes:
    >
    > |self.each_byte {|b| print "#{b} "} =3D> 109 233 100 105 99 97 108 115
    > |
    > |233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
    > |self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
    > |byte sequence in UTF-8 (ArgumentError).
    >
    > 233 is not a valid UTF-8 character. =A0The byte sequence for m=E9dicals i=

    s
    > <109 195 169 100 105 99 97 108 115>.


    233 for e accent acute would be valid for ISO-8859-1 encoding, not UTF-8.


    --=20
    Rick DeNatale

    Blog: http://talklikeaduck.denhaven2.com/
    Twitter: http://twitter.com/RickDeNatale
    WWR: http://www.workingwithrails.com/person/9021-rick-denatale
    LinkedIn: http://www.linkedin.com/in/rickdenatale
    Rick DeNatale, Feb 23, 2010
    #8
  9. 2010/2/23 Yukihiro Matsumoto <>:
    > 233 is not a valid UTF-8 character. =A0The byte sequence for m=E9dicals i=

    s
    > <109 195 169 100 105 99 97 108 115>.


    Indeed. In the meantime, I changed the code with this one:
    def isUTF8()
    begin
    self.unpack('U*')
    rescue
    return false
    end
    return true
    end

    if isUTF8()
    self.force_encoding('UTF-8')
    else
    self.force_encoding('ISO-8859-1')
    self.encode!('UTF-8')
    end

    This (ugly) quickfix works for what I need, but I don't know if this
    problem can be somehow resolved in another way. The problem being that
    my SQL database has a VARBINARY column with an unknown encoding. Is
    there a way to deal with the various possible encoding or to ask MySQL
    to return UTF8 converted data, or is it necessary to clean data before
    inserting them ?

    --=20
    Xavier NOELLE
    Xavier Noëlle, Feb 23, 2010
    #9
  10. Xavier Noëlle

    Perry Smith Guest

    Re: UTF8 hell

    > A general hint for debugging encoding troubles: the UTF-8 encoding
    > *guarantees* that every Unicode codepoint is *either* encoded into a
    > *single* octet with its most significant bit cleared to 0 (i.e. a
    > decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
    > octets, *all* of which have their MSB set to 1 (i.e. a decimal value
    > between 128 and 255).


    Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
    or 6 but not 3 nor 5 octects?

    --
    Posted via http://www.ruby-forum.com/.
    Perry Smith, Feb 23, 2010
    #10
  11. Re: UTF8 hell

    Perry Smith wrote:
    >> A general hint for debugging encoding troubles: the UTF-8 encoding
    >> *guarantees* that every Unicode codepoint is *either* encoded into a
    >> *single* octet with its most significant bit cleared to 0 (i.e. a
    >> decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
    >> octets, *all* of which have their MSB set to 1 (i.e. a decimal value
    >> between 128 and 255).

    > Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
    > or 6 but not 3 nor 5 octects?


    Nope.

    First off: I was wrong, the longest encoding is actually 4 octets,
    not 6. (I was confused by the algorithm: the algorithm actually allows
    for up to 8 bytes, but because of the way Unicode characters are
    allocated, and UTF-8 is defined, it is guaranteed that there will
    never be more than 4.)

    The encodings look like this:

    0xxxxxxx for ASCII
    110xxxxx 10xxxxxx for U+80 to U+7FF
    1110xxxx 10xxxxxx 10xxxxxx for U+800 to U+FFFF and
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for U+1000 to U+1FFFFF

    This is actually pretty clever:

    * you can always tell whether you are inside a multibyte sequence or
    not because of the high bit,
    * you can always tell whether a byte in the sequence is the first one
    or a later one, because the first one always starts with 11 and the
    other ones always start with 10 and
    * you can always tell how long a sequence is by the number of 1 bits
    in the start byte: two-byte sequences start with two 1s, three-byte
    sequences start with three 1s and four-byte sequences start with
    four 1s.

    This means that you can usually re-synchronize pretty easily from the
    middle of a corrupted network transmission, for example. You can also
    jump over bytes if you are counting the length.

    jwm
    Jörg W Mittag, Feb 23, 2010
    #11
  12. On 23.02.2010 12:10, Xavier Noëlle wrote:
    > 2010/2/2 Robert Klemme <>:
    >> You probably first want to find out whether the byte sequence is valid
    >> UTF-8 or not. For that you would need to look at the bytes in the
    >> String. I guess chances are that your String's byte sequence is NOT
    >> valid UTF-8 OR you have a character in the string that has no
    >> lowercase representation.


    > I dug into the problem and ended up with this line: self.force_encoding('UTF-8')
    > Believing that the string #encoding was right was a wrong choice, then
    > I assumed the database provided valid UTF8 strings.


    The string you show below does not look like UTF-8 encoded, probably
    rather ISO-8859-1 or such. If you enforce an encoding you leave the
    byte sequence untouched. This leads to the kind of error you describe
    below.

    > BUT (because, there's a but...), for some reason I don't understand,
    > some strings are unwilling to work:
    >
    > Example:
    > puts self => médicals
    > self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
    >
    > 233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
    > self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
    > byte sequence in UTF-8 (ArgumentError).
    >
    > Where am I wrong ?


    As far as I can see 233 starts a three byte sequence

    http://en.wikipedia.org/wiki/UTF-8#Description

    I did not dig deeper but it may be that by forcing UTF-8 on an ISO
    something encoded string you broke it.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Feb 23, 2010
    #12
  13. On Wed, Feb 24, 2010 at 12:18 AM, Xavier No=C3=ABlle <=
    om> wrote:
    > 2010/2/23 Yukihiro Matsumoto <>:
    >> 233 is not a valid UTF-8 character. =C2=A0The byte sequence for m=C3=A9d=

    icals is
    >> <109 195 169 100 105 99 97 108 115>.

    >
    > Indeed. In the meantime, I changed the code with this one:
    > def isUTF8()
    > =C2=A0begin
    > =C2=A0 =C2=A0self.unpack('U*')
    > =C2=A0rescue
    > =C2=A0 =C2=A0return false
    > =C2=A0end
    > =C2=A0return true
    > end
    >
    > if isUTF8()
    > =C2=A0self.force_encoding('UTF-8')
    > else
    > =C2=A0self.force_encoding('ISO-8859-1')
    > =C2=A0self.encode!('UTF-8')
    > end


    string =3D "\xE8te pour luth"
    # "\xE8te pour luth"
    string.encoding
    # #<Encoding:UTF-8>
    string.valid_encoding?
    # false
    string.force_encoding('ISO-8859-1')
    # "=C3=A8te pour luth"
    string.valid_encoding?
    # true
    string.upcase
    # "=C3=A8TE POUR LUTH"


    > This (ugly) quickfix works for what I need, but I don't know if this
    > problem can be somehow resolved in another way. The problem being that
    > my SQL database has a VARBINARY column with an unknown encoding. Is
    > there a way to deal with the various possible encoding or to ask MySQL
    > to return UTF8 converted data, or is it necessary to clean data before
    > inserting them ?
    >
    > --
    > Xavier NOELLE
    >
    >




    --=20
    Michael Fellinger
    CTO, The Rubyists, LLC
    972-996-5199
    Michael Fellinger, Feb 24, 2010
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,796
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. fscked
    Replies:
    8
    Views:
    435
    Stefan Behnel
    Apr 14, 2007
  3. southof40
    Replies:
    3
    Views:
    417
    southof40
    Mar 8, 2011
  4. Damphyr

    Encoding hell

    Damphyr, Sep 5, 2005, in forum: Ruby
    Replies:
    7
    Views:
    135
    Zach Dennis
    Sep 5, 2005
  5. gry
    Replies:
    2
    Views:
    703
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page