Forcing a string to valid UTF-8

Discussion in 'Ruby' started by Phrogz, Apr 26, 2010.

  1. Phrogz

    Phrogz Guest

    I have some legacy text data that's gone through several databases and
    web services in its life, playing promiscuously with dirty web
    servers, browsers, and encodings.

    It's coming out of the source database as ASCII-8bit. I'm trying to
    bring it all into UTF-8. I've found ways to coerce many of the bad
    entries into compliance, but now I've hit one that is simply bad. I
    want to just delete the minimum necessary to make it valid UTF-8. What
    I'm trying isn't working. Here's my code:

    if new_value.is_a? String
    begin
    utf8 = new_value.force_encoding('UTF-8')
    if utf8.valid_encoding?
    new_value = utf8
    else
    new_value.encode!( 'UTF-8', 'Windows-1252' )
    end
    rescue EncodingError => e
    puts "Bad encoding: #{old_table}.#{pk}:#{old_row[pk]} -
    #{new_value.inspect}"
    new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace,
    replace: '' )
    p new_value.encoding unless new_value.valid_encoding?
    end
    end

    When I fall into the rescue clause, I'm getting out:
    Bad encoding: bugs.id:2469 - "Indexing C:\\\\コピ\xE3\x81E \x81E
    \x81EZCa_zu5.264"
    #<Encoding:UTF-8>
    The conversion resulted in an invalid UTF-8 string (that happens to be
    the same as the original, as far as I can tell.) I'm surprised,
    because I thought the purpose of invalid/undef replace was to clean
    things up.

    How do I force it into a valid UTF-8 encoding, losing as little data
    as possible but happily throwing out the senseless bits?
    Phrogz, Apr 26, 2010
    #1
    1. Advertising

  2. Gavin Kistner wrote:
    > How do I force it into a valid UTF-8 encoding, losing as little data
    > as possible but happily throwing out the senseless bits?


    AFAICS, the trouble with your rescue clause is that the string failed to
    be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
    and so an attempt to "re-encode" as UTF-8 is silently ignored because
    it's already UTF-8, even though it contains invalid characters.

    For example, this doesn't do anything:

    >> a = "abc\xffdef".force_encoding("UTF-8")

    => "abc\xFFdef"
    >> b = a.encode("UTF-8", :invalid=>:replace, :replace=>"?")

    => "abc\xFFdef"

    but this does:

    >> b = a.encode("UTF-16BE", :invalid=>:replace, :replace=>"?").encode("UTF-8")

    => "abc?def"

    Proviso: ruby 1.9 string handling is undocumented and subject to
    continuous change. I tested the above with

    >> RUBY_DESCRIPTION

    => "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

    so it may or may not work with your version, or with future versions of
    Ruby.
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Apr 27, 2010
    #2
    1. Advertising

  3. Phrogz

    Phrogz Guest

    On Apr 27, 4:19 am, Brian Candler <> wrote:
    > Gavin Kistner wrote:
    > > How do I force it into a valid UTF-8 encoding, losing as little data
    > > as possible but happily throwing out the senseless bits?

    >
    > AFAICS, the trouble with your rescue clause is that the string failed to
    > be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
    > and so an attempt to "re-encode" as UTF-8 is silently ignored because
    > it's already UTF-8, even though it contains invalid characters.


    Excellent point. Fixing that led me to a similar error earlier: I had
    assumed that
    s2 = s1.force_encoding(...)
    left s1 intact. In fact, it modifies and returns s1. Thank you very
    much, Brian.

    For those that care or stumble upon this via Google, here's a modified
    version that works:

    # Converting ASCII-8BIT to UTF-8 based domain-specific guesses
    if new_value.is_a? String
    begin
    # Try it as UTF-8 directly
    cleaned = new_value.dup.force_encoding('UTF-8')
    unless cleaned.valid_encoding?
    # Some of it might be old Windows code page
    cleaned = new_value.encode( 'UTF-8', 'Windows-1252' )
    end
    new_value = cleaned
    rescue EncodingError
    # Force it to UTF-8, throwing out invalid bits
    new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace )
    end
    end

    > Proviso: ruby 1.9 string handling is undocumented and subject to
    > continuous change. I tested the above with


    FWIW my new code works on ruby 1.9.1p243 (2009-07-16 revision 24175)
    [i386-mingw32]

    Thanks again!
    Phrogz, Apr 27, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,147
    Joerg Jooss
    Apr 24, 2004
  2. Jafar As-Sadiq Calley

    Forcing browser to use UTF-8

    Jafar As-Sadiq Calley, Oct 17, 2005, in forum: HTML
    Replies:
    3
    Views:
    534
    Jafar As-Sadiq Calley
    Oct 17, 2005
  3. moonhkt
    Replies:
    18
    Views:
    2,517
    Roedy Green
    Feb 5, 2010
  4. News123
    Replies:
    0
    Views:
    356
    News123
    Jun 6, 2010
  5. Yohan N. Leder

    How to mark UTF-8 string as being UTF-8

    Yohan N. Leder, Jun 2, 2006, in forum: Perl Misc
    Replies:
    9
    Views:
    123
    Alan J. Flavell
    Jun 5, 2006
Loading...

Share This Page