How could I make the Ruby 1.9 string ignore the invalid utf-8 bytesequence in split?


S

Stanley Xu

[Note: parts of this message were removed to make it a legal post.]

Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence as
the input. So while running things like

line.chomp.split("\t")

I will get

Best wishes,
Stanley Xu
 
Ad

Advertisements

S

Stanley Xu

[Note: parts of this message were removed to make it a legal post.]

Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

I just resent a mail to described the problem.

Best wishes,
Stanley Xu
 
R

Robert Klemme

Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

I just resent a mail to described the problem.

Did you? I can't seem to find it.

Cheers

robert
 
S

Stanley Xu

[Note: parts of this message were removed to make it a legal post.]

Anyway, let me resend it again.


Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence as
the input. So while running things like

line.chomp.split("\t")

I will get errors like
:in `split': invalid byte sequence in UTF-8 (ArgumentError)

I searched a little bit and try to use iconv to ignore the invalid sequence
by

if !line.valid_encoding?
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
line = ic.iconv(line)
end

It resolve most of the invalid lines but will still a couple of line will
have the same error.

I am wondering if there is a way I could let the string.split() worked in
ruby1.9 with invalid character sequences?

Thanks in advance

Best wishes,
Stanley Xu
 
J

Joey Zhou

Maybe you should not encode the data from its external_encoding to
UTF-8.
I had been trapped in the encoding problem that some GBK characters
cannot transform to UTF-8.

# encoding: utf-8
File.open('file.txt', 'r:gbk').each_line do |line| # not 'r:gbk:utf-8'
arr = line.chomp.split("\t".encode('gbk')) # encode "\t" to gbk
# blah blah
end

Joey
 
S

Stanley Xu

[Note: parts of this message were removed to make it a legal post.]

Hi Joey,

I don't think that's the problem. It is probably a file with utf-8
characters. Like 1 millions lines could be split well, but 1000 of them will
get the "invalid bytes sequence error".

Now I have a temporary solution like the following:

if !line.valid_encoding?
line = line.unpack('C*').pack('U*')
end
fields = line.chomp.split("\t")

But I really doubt it is a good solution, for the invalid character might
means a valid sequence in gbk or something like that.

Isn't there a way I could split the string in ruby 1.9 in the old 1.8 "dirty
way"?

Best wishes,
Stanley Xu
 
Ad

Advertisements

J

Joey Zhou

I am working with Chinese character radicals.
I came across a radical which has a codepoint "\uE839".

-----

# encoding: utf-8
[ STDIN, STDOUT, STDERR ].each do |stdio|
stdio.set_encoding( 'gbk', 'utf-8' )
end
char = "\uE839"
puts char # Encoding::UndefinedConversionError

-----
f.rb:7:in `write': U+E839 from UTF-8 to GBK
(Encoding::UndefinedConversionError)
from f.rb:7:in `puts'
from f.rb:7:in `puts'
from f.rb:7:in `<main>'

But Perl works

-----
use utf8;
use open ":encoding(gbk)", ":std";

$char = "\N{U+E839}";
print $char;
 
Ad

Advertisements

R

Ryan Davis

This sounds like it might be a legitimate bug. Can you file a ticket on =
redmine with this code sample?
 

Top