How could I make the Ruby 1.9 string ignore the invalid utf-8 bytesequence in split?

Discussion in 'Ruby' started by Stanley Xu, Mar 22, 2011.

  1. Stanley Xu

    Stanley Xu Guest

    [Note: parts of this message were removed to make it a legal post.]

    Dear buddies,

    I am using ruby to run some map reduce job in hadoop streaming.
    Unfortunately, we have some dirty data which have invalid byte sequence as
    the input. So while running things like

    line.chomp.split("\t")

    I will get

    Best wishes,
    Stanley Xu
     
    Stanley Xu, Mar 22, 2011
    #1
    1. Advertisements

  2. Stanley Xu

    Stanley Xu Guest

    [Note: parts of this message were removed to make it a legal post.]

    Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

    I just resent a mail to described the problem.

    Best wishes,
    Stanley Xu
     
    Stanley Xu, Mar 22, 2011
    #2
    1. Advertisements

  3. Did you? I can't seem to find it.

    Cheers

    robert
     
    Robert Klemme, Mar 22, 2011
    #3
  4. Stanley Xu

    Stanley Xu Guest

    [Note: parts of this message were removed to make it a legal post.]

    Anyway, let me resend it again.


    Dear buddies,

    I am using ruby to run some map reduce job in hadoop streaming.
    Unfortunately, we have some dirty data which have invalid byte sequence as
    the input. So while running things like

    line.chomp.split("\t")

    I will get errors like
    :in `split': invalid byte sequence in UTF-8 (ArgumentError)

    I searched a little bit and try to use iconv to ignore the invalid sequence
    by

    if !line.valid_encoding?
    ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
    line = ic.iconv(line)
    end

    It resolve most of the invalid lines but will still a couple of line will
    have the same error.

    I am wondering if there is a way I could let the string.split() worked in
    ruby1.9 with invalid character sequences?

    Thanks in advance

    Best wishes,
    Stanley Xu
     
    Stanley Xu, Mar 22, 2011
    #4
  5. Stanley Xu

    Joey Zhou Guest

    Maybe you should not encode the data from its external_encoding to
    UTF-8.
    I had been trapped in the encoding problem that some GBK characters
    cannot transform to UTF-8.

    # encoding: utf-8
    File.open('file.txt', 'r:gbk').each_line do |line| # not 'r:gbk:utf-8'
    arr = line.chomp.split("\t".encode('gbk')) # encode "\t" to gbk
    # blah blah
    end

    Joey
     
    Joey Zhou, Mar 22, 2011
    #5
  6. Stanley Xu

    Stanley Xu Guest

    [Note: parts of this message were removed to make it a legal post.]

    Hi Joey,

    I don't think that's the problem. It is probably a file with utf-8
    characters. Like 1 millions lines could be split well, but 1000 of them will
    get the "invalid bytes sequence error".

    Now I have a temporary solution like the following:

    if !line.valid_encoding?
    line = line.unpack('C*').pack('U*')
    end
    fields = line.chomp.split("\t")

    But I really doubt it is a good solution, for the invalid character might
    means a valid sequence in gbk or something like that.

    Isn't there a way I could split the string in ruby 1.9 in the old 1.8 "dirty
    way"?

    Best wishes,
    Stanley Xu
     
    Stanley Xu, Mar 22, 2011
    #6
  7. Stanley Xu

    Joey Zhou Guest

    I am working with Chinese character radicals.
    I came across a radical which has a codepoint "\uE839".

    -----

    # encoding: utf-8
    [ STDIN, STDOUT, STDERR ].each do |stdio|
    stdio.set_encoding( 'gbk', 'utf-8' )
    end
    char = "\uE839"
    puts char # Encoding::UndefinedConversionError

    -----
    f.rb:7:in `write': U+E839 from UTF-8 to GBK
    (Encoding::UndefinedConversionError)
    from f.rb:7:in `puts'
    from f.rb:7:in `puts'
    from f.rb:7:in `<main>'

    But Perl works

    -----
    use utf8;
    use open ":encoding(gbk)", ":std";

    $char = "\N{U+E839}";
    print $char;
     
    Joey Zhou, Mar 22, 2011
    #7
  8. Stanley Xu

    Ryan Davis Guest

    This sounds like it might be a legitimate bug. Can you file a ticket on =
    redmine with this code sample?
     
    Ryan Davis, Mar 22, 2011
    #8
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.