P
Phrogz
I have some legacy text data that's gone through several databases and
web services in its life, playing promiscuously with dirty web
servers, browsers, and encodings.
It's coming out of the source database as ASCII-8bit. I'm trying to
bring it all into UTF-8. I've found ways to coerce many of the bad
entries into compliance, but now I've hit one that is simply bad. I
want to just delete the minimum necessary to make it valid UTF-8. What
I'm trying isn't working. Here's my code:
if new_value.is_a? String
begin
utf8 = new_value.force_encoding('UTF-8')
if utf8.valid_encoding?
new_value = utf8
else
new_value.encode!( 'UTF-8', 'Windows-1252' )
end
rescue EncodingError => e
puts "Bad encoding: #{old_table}.#{pk}:#{old_row[pk]} -
#{new_value.inspect}"
new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace,
replace: '' )
p new_value.encoding unless new_value.valid_encoding?
end
end
When I fall into the rescue clause, I'm getting out:
Bad encoding: bugs.id:2469 - "Indexing C:\\\\コピ\xE3\x81E \x81E
\x81EZCa_zu5.264"
#<Encoding:UTF-8>
The conversion resulted in an invalid UTF-8 string (that happens to be
the same as the original, as far as I can tell.) I'm surprised,
because I thought the purpose of invalid/undef replace was to clean
things up.
How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?
web services in its life, playing promiscuously with dirty web
servers, browsers, and encodings.
It's coming out of the source database as ASCII-8bit. I'm trying to
bring it all into UTF-8. I've found ways to coerce many of the bad
entries into compliance, but now I've hit one that is simply bad. I
want to just delete the minimum necessary to make it valid UTF-8. What
I'm trying isn't working. Here's my code:
if new_value.is_a? String
begin
utf8 = new_value.force_encoding('UTF-8')
if utf8.valid_encoding?
new_value = utf8
else
new_value.encode!( 'UTF-8', 'Windows-1252' )
end
rescue EncodingError => e
puts "Bad encoding: #{old_table}.#{pk}:#{old_row[pk]} -
#{new_value.inspect}"
new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace,
replace: '' )
p new_value.encoding unless new_value.valid_encoding?
end
end
When I fall into the rescue clause, I'm getting out:
Bad encoding: bugs.id:2469 - "Indexing C:\\\\コピ\xE3\x81E \x81E
\x81EZCa_zu5.264"
#<Encoding:UTF-8>
The conversion resulted in an invalid UTF-8 string (that happens to be
the same as the original, as far as I can tell.) I'm surprised,
because I thought the purpose of invalid/undef replace was to clean
things up.
How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?