How could I make the Ruby 1.9 string ignore the invalid utf-8 bytesequence in split?

Stanley Xu · Mar 22, 2011

[Note: parts of this message were removed to make it a legal post.]

Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence as
the input. So while running things like

line.chomp.split("\t")

I will get

Best wishes,
Stanley Xu

Stanley Xu · Mar 22, 2011

[Note: parts of this message were removed to make it a legal post.]

Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

I just resent a mail to described the problem.

Best wishes,
Stanley Xu

Robert Klemme · Mar 22, 2011

Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

I just resent a mail to described the problem.

Did you? I can't seem to find it.

Cheers

robert

Stanley Xu · Mar 22, 2011

[Note: parts of this message were removed to make it a legal post.]

Anyway, let me resend it again.

Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence as
the input. So while running things like

line.chomp.split("\t")

I will get errors like
:in `split': invalid byte sequence in UTF-8 (ArgumentError)

I searched a little bit and try to use iconv to ignore the invalid sequence
by

if !line.valid_encoding?
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
line = ic.iconv(line)
end

It resolve most of the invalid lines but will still a couple of line will
have the same error.

I am wondering if there is a way I could let the string.split() worked in
ruby1.9 with invalid character sequences?

Thanks in advance

Best wishes,
Stanley Xu

Joey Zhou · Mar 22, 2011

Maybe you should not encode the data from its external_encoding to
UTF-8.
I had been trapped in the encoding problem that some GBK characters
cannot transform to UTF-8.

# encoding: utf-8
File.open('file.txt', 'r:gbk').each_line do |line| # not 'r:gbk:utf-8'
arr = line.chomp.split("\t".encode('gbk')) # encode "\t" to gbk
# blah blah
end

Joey

Stanley Xu · Mar 22, 2011

[Note: parts of this message were removed to make it a legal post.]

Hi Joey,

I don't think that's the problem. It is probably a file with utf-8
characters. Like 1 millions lines could be split well, but 1000 of them will
get the "invalid bytes sequence error".

Now I have a temporary solution like the following:

if !line.valid_encoding?
line = line.unpack('C*').pack('U*')
end
fields = line.chomp.split("\t")

But I really doubt it is a good solution, for the invalid character might
means a valid sequence in gbk or something like that.

Isn't there a way I could split the string in ruby 1.9 in the old 1.8 "dirty
way"?

Best wishes,
Stanley Xu

Joey Zhou · Mar 22, 2011

I am working with Chinese character radicals.
I came across a radical which has a codepoint "\uE839".

-----

# encoding: utf-8
[ STDIN, STDOUT, STDERR ].each do |stdio|
stdio.set_encoding( 'gbk', 'utf-8' )
end
char = "\uE839"
puts char # Encoding::UndefinedConversionError

-----
f.rb:7:in `write': U+E839 from UTF-8 to GBK
(Encoding::UndefinedConversionError)
from f.rb:7:in `puts'
from f.rb:7:in `puts'
from f.rb:7:in `<main>'

But Perl works

-----
use utf8;
use open ":encoding(gbk)", ":std";

$char = "\N{U+E839}";
print $char;

Ryan Davis · Mar 22, 2011

This sounds like it might be a legitimate bug. Can you file a ticket on =
redmine with this code sample?

How to use String.split to split a mixed encoding string(partencoded in gbk, part encoded in utf-8)	2	Mar 22, 2011
How do I debug mysql syntax problems in ruby code?	1	Jun 8, 2008
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003

How could I make the Ruby 1.9 string ignore the invalid utf-8 bytesequence in split?

Stanley Xu

Stanley Xu

Robert Klemme

Stanley Xu

Joey Zhou

Stanley Xu

Joey Zhou

Ryan Davis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads