Forcing a string to valid UTF-8

P

Phrogz

I have some legacy text data that's gone through several databases and
web services in its life, playing promiscuously with dirty web
servers, browsers, and encodings.

It's coming out of the source database as ASCII-8bit. I'm trying to
bring it all into UTF-8. I've found ways to coerce many of the bad
entries into compliance, but now I've hit one that is simply bad. I
want to just delete the minimum necessary to make it valid UTF-8. What
I'm trying isn't working. Here's my code:

if new_value.is_a? String
begin
utf8 = new_value.force_encoding('UTF-8')
if utf8.valid_encoding?
new_value = utf8
else
new_value.encode!( 'UTF-8', 'Windows-1252' )
end
rescue EncodingError => e
puts "Bad encoding: #{old_table}.#{pk}:#{old_row[pk]} -
#{new_value.inspect}"
new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace,
replace: '' )
p new_value.encoding unless new_value.valid_encoding?
end
end

When I fall into the rescue clause, I'm getting out:
Bad encoding: bugs.id:2469 - "Indexing C:\\\\コピ\xE3\x81E \x81E
\x81EZCa_zu5.264"
#<Encoding:UTF-8>
The conversion resulted in an invalid UTF-8 string (that happens to be
the same as the original, as far as I can tell.) I'm surprised,
because I thought the purpose of invalid/undef replace was to clean
things up.

How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?
 
B

Brian Candler

Gavin said:
How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?

AFAICS, the trouble with your rescue clause is that the string failed to
be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
and so an attempt to "re-encode" as UTF-8 is silently ignored because
it's already UTF-8, even though it contains invalid characters.

For example, this doesn't do anything:
=> "abc\xFFdef"

but this does:
=> "abc?def"

Proviso: ruby 1.9 string handling is undocumented and subject to
continuous change. I tested the above with
=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

so it may or may not work with your version, or with future versions of
Ruby.
 
P

Phrogz

AFAICS, the trouble with your rescue clause is that the string failed to
be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
and so an attempt to "re-encode" as UTF-8 is silently ignored because
it's already UTF-8, even though it contains invalid characters.

Excellent point. Fixing that led me to a similar error earlier: I had
assumed that
s2 = s1.force_encoding(...)
left s1 intact. In fact, it modifies and returns s1. Thank you very
much, Brian.

For those that care or stumble upon this via Google, here's a modified
version that works:

# Converting ASCII-8BIT to UTF-8 based domain-specific guesses
if new_value.is_a? String
begin
# Try it as UTF-8 directly
cleaned = new_value.dup.force_encoding('UTF-8')
unless cleaned.valid_encoding?
# Some of it might be old Windows code page
cleaned = new_value.encode( 'UTF-8', 'Windows-1252' )
end
new_value = cleaned
rescue EncodingError
# Force it to UTF-8, throwing out invalid bits
new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace )
end
end
Proviso: ruby 1.9 string handling is undocumented and subject to
continuous change. I tested the above with

FWIW my new code works on ruby 1.9.1p243 (2009-07-16 revision 24175)
[i386-mingw32]

Thanks again!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top