How to use String.split to split a mixed encoding string(partencoded in gbk, part encoded in utf-8)

Discussion in 'Ruby' started by Stanley Xu, Mar 23, 2011.

  1. Stanley Xu

    Stanley Xu Guest

    [Note: parts of this message were removed to make it a legal post.]

    Dear Buddies,

    Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
    sequences. And I checked the string I wanted to parse in Java and found out
    that the string is encoded in gbk and part of the string is encoded in
    utf-8.

    I am wondering if I could find a way to still split the string by split
    method, and then I could try to force_encoding part of the string that might
    encoded in gbk and resolve the problem.

    I am wondering if there is a way I could do so without the "invalid bytes
    sequence" error?

    Thanks.

    Best wishes,
    Stanley Xu
     
    Stanley Xu, Mar 23, 2011
    #1
    1. Advertisements

  2. A string with a mixed encoding is difficult to handle. I think you
    have these options

    1. Ensure that the string does *not* contain mixed encoding (this
    would be the first and best choice IMHO).

    2. If you can't because you get the data from somewhere else, use
    encoding BINARY as a diversion:

    mixed_content.force_encoding Encoding::BINARY
    chunks = mixed_content.split /\t/
    chunks[0].force_encoding Encoding::UTF_8
    chunks[1].force_encoding Encoding::GBK

    or

    mixed_content.force_encoding Encoding::BINARY
    a, b = mixed_content.split /\t/
    a.force_encoding Encoding::UTF_8
    b.force_encoding Encoding::GBK

    Kind regards

    robert
     
    Robert Klemme, Mar 23, 2011
    #2
    1. Advertisements

  3. Stanley Xu

    Stanley Xu Guest

    [Note: parts of this message were removed to make it a legal post.]

    Thanks a lot, Robert. Your solution really helps.

    Best wishes,
    Stanley Xu



     
    Stanley Xu, Mar 23, 2011
    #3
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.