Discussion in 'Ruby' started by Atoli Atoli, Nov 18, 2010.

  Atoli Atoli

    Atoli Atoli Guest


    My ruby 1.9.2 does some strange things when manipulating encodings
    especially when reading and writing text file.
    I have a file (attached) with broken UTF-8 characters: E5 AD 97 E6 99 2E
    The offending sequence is E6 99

    And here's the example I'm using to try to figure out how the heck does
    this Encoding stuff work:
    data ="broken.txt", "r:UTF-8") { |f| }
    puts data.valid_encoding?

    utf_a = data.encode("UTF-8", invalid: :replace, undef: :replace,
    replace: "_")
    puts utf_a.valid_encoding?

    utf_b = data.encode("UTF-8", "UTF-8", invalid: :replace, undef:
    :replace, replace: "_")
    puts utf_b.valid_encoding?
    puts (utf_a == utf_b) && (data == utf_b)"valid.txt", "w:UTF-8") { |f| f.write(utf_a) }

    The output is:

    Basically I'm trying to replace the broken sequences with "_", but the
    encode method doesn't seem to do any replacements, maybe because the
    forced encoding is already set to UTF-8?

    I've read James Edward post concerning strings encoding in ruby 1.9 and
    also candlerb's doc, but didn't find anything.

    This can't be so complicated, right? Sure I'm missing something.

    Thank you.


    Atoli Atoli, Nov 18, 2010
  brabuhr

    brabuhr Guest

    For the case of fixing broken files, I would probably use iconv from the sh=

    $ cat broken.txt
    $ iconv -f UTF8 -t UTF8 --byte-subst=3D_ broken.txt

    (I don't know if Ruby's Iconv module supports the subst options.)

    For short strings, this seems to work:

    irb(main):001:0> s =3D "\xE5\xAD\x97\xE6\x99\x2E"
    =3D> "=E5=AD=97\xE6\x99."
    irb(main):002:0> s.encoding
    =3D> #<Encoding:UTF-8>
    irb(main):003:0> s.valid_encoding?
    =3D> false
    irb(main):004:0> t =3D{|c| c.valid_encoding? ? c : '_'}.join
    =3D> "=E5=AD=97__."
    irb(main):005:0> t.valid_encoding?
    =3D> true
    irb(main):006:0> t.encoding
    =3D> #<Encoding:UTF-8>
    brabuhr, Nov 18, 2010
  Atoli Atoli

    Atoli Atoli Guest

    Thanks for the tip.

    For now, making ruby "think" the encoding is valid seems to work (it
    doesn't break regular expressions at least).
    So I just encode("UTF-8", "UTF-8", invalid: :replace, undef:
    :replace, replace: "_") each time I read my files.
    Atoli Atoli, Nov 18, 2010
