Ruby 1.9.2: How to sanitize text with invalid characters?

A

Andreas S.

I process a lot of text files of which I know the encoding, but that
might contain a few bytes that are invalid (i.e., make gsub fail with
"ArgumentError: invalid byte sequence in US-ASCII/UTF8"). What's the
best way to handle this situation gracefully, by ignoring or removing
the invalid characters?
 
S

Scott Gonyea

Code / Data samples?

r_string = 'blah'.encode('UTF-8')
r_regex = /#{r_string}/
text = "wahlahblahblahwahbalablah".encode("UTF-8")
text.gsub!(r_regex, '')

That's a horrible example. Still, if you have ASCII in one place, and
UTF-8 in another, it's conceivable that the matcher may just throw up
its hands. Force the encoding and try again. If it doesn't work,
please post more information (preferably with a Gist / pastie). If
that helps, please mention it so that Google can direct other poor
souls to this post.

Scott
 
A

Andreas S.

Scott Gonyea wrote in post #949026:
Code / Data samples?

Trivial example:
"#{0xFF.chr} abcde".force_encoding("utf-8").gsub(/a/,'')
ArgumentError: invalid byte sequence in UTF-8
 
S

Scott Gonyea

Will this work?

blah1 = "#{0xFF.chr} abcde"
blah2 = blah.split(/[^[:print:]]/).join
 
A

Andreas S.

Using iconv to clean the string works:
Iconv.conv('utf-8//IGNORE','utf-8',"#{0xFF.chr} abcde")
=> " abcde"

However, it would be nicer if there was a way to do this with the
built-in encoding functions of Ruby 1.9.
 
A

Andreas S.

Scott Gonyea wrote in post #949256:
Will this work?

blah1 = "#{0xFF.chr} abcde"
blah2 = blah.split(/[^[:print:]]/).join

Only if the desired encoding is ASCII.
 
Q

Quintus

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 12.10.2010 01:16, schrieb Andreas S.:
Using iconv to clean the string works:
Iconv.conv('utf-8//IGNORE','utf-8',"#{0xFF.chr} abcde")
=> " abcde"

However, it would be nicer if there was a way to do this with the
built-in encoding functions of Ruby 1.9.

String#encode can do this much nicer:
============================================
$ irb
irb(main):001:0> RUBY_DESCRIPTION
=> "ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]"
irb(main):002:0> str = "#{0xFF.chr}"
=> "\xFF"
irb(main):003:0> str.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> str.encode("UTF-8")
Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8
from (irb):4:in `encode'
from (irb):4
from /opt/rubies/ruby-1.9.2-p0/bin/irb:12:in `<main>'
irb(main):005:0> str.encode("UTF-8", :invalid => :replace, :undef =>
:replace, :replace => "?")
=> "?"
irb(main):006:0>
============================================
In order to remove invalid chars completely, use an empty string instead
of "?".

Vale,
Marvin
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJMtA4iAAoJEGrS0YjAWTKV3T0H/0871zefFCUGMrNt69O2JjOJ
waH6Kwi3VqQzXS/AW/UdGFS7BGJwD70Rn62D43MMhqQ1gzPEdIlecMuDl1QZwp06
Fu1cuLE0lvWh0ecS0ahBRgmc0fdGPAM7/EKKIHsXuhfFJgoS0ttVVQ363UbMYXst
jMUrDAlJJ5fpasptxz9avq5MwAFyBvFXOqsRVuWrsZyuMy/akdWysUF9CoxtnIyp
mKh/dmkZ+tWZNuDHTRwFmXcxOFmwrJB8oXIGurKKDiseo2/K8KkldwCjNKRhNBfn
6RInFulYLDiywIYDPF/M4k5fDfnwhuFMF9qWtnoQuoXK/rPV4Al/oNXyEXLPICU=
=M4ng
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top