How to use String.split to split a mixed encoding string(partencoded in gbk, part encoded in utf-8)

Discussion in 'Ruby' started by Stanley Xu, Mar 23, 2011.

  1. Stanley Xu

    Stanley Xu Guest

    [Note: parts of this message were removed to make it a legal post.]

    Dear Buddies,

    Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
    sequences. And I checked the string I wanted to parse in Java and found out
    that the string is encoded in gbk and part of the string is encoded in
    utf-8.

    I am wondering if I could find a way to still split the string by split
    method, and then I could try to force_encoding part of the string that might
    encoded in gbk and resolve the problem.

    I am wondering if there is a way I could do so without the "invalid bytes
    sequence" error?

    Thanks.

    Best wishes,
    Stanley Xu
    Stanley Xu, Mar 23, 2011
    #1
    1. Advertising

  2. On Wed, Mar 23, 2011 at 4:53 AM, Stanley Xu <> wrote:
    > Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
    > sequences. And I checked the string I wanted to parse in Java and found out
    > that the string is encoded in gbk and part of the string is encoded in
    > utf-8.
    >
    > I am wondering if I could find a way to still split the string by split
    > method, and then I could try to force_encoding part of the string that might
    > encoded in gbk and resolve the problem.
    >
    > I am wondering if there is a way I could do so without the "invalid bytes
    > sequence" error?


    A string with a mixed encoding is difficult to handle. I think you
    have these options

    1. Ensure that the string does *not* contain mixed encoding (this
    would be the first and best choice IMHO).

    2. If you can't because you get the data from somewhere else, use
    encoding BINARY as a diversion:

    mixed_content.force_encoding Encoding::BINARY
    chunks = mixed_content.split /\t/
    chunks[0].force_encoding Encoding::UTF_8
    chunks[1].force_encoding Encoding::GBK

    or

    mixed_content.force_encoding Encoding::BINARY
    a, b = mixed_content.split /\t/
    a.force_encoding Encoding::UTF_8
    b.force_encoding Encoding::GBK

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Mar 23, 2011
    #2
    1. Advertising

  3. Stanley Xu

    Stanley Xu Guest

    [Note: parts of this message were removed to make it a legal post.]

    Thanks a lot, Robert. Your solution really helps.

    Best wishes,
    Stanley Xu



    On Wed, Mar 23, 2011 at 5:32 PM, Robert Klemme
    <>wrote:

    > On Wed, Mar 23, 2011 at 4:53 AM, Stanley Xu <> wrote:
    > > Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
    > > sequences. And I checked the string I wanted to parse in Java and found

    > out
    > > that the string is encoded in gbk and part of the string is encoded in
    > > utf-8.
    > >
    > > I am wondering if I could find a way to still split the string by split
    > > method, and then I could try to force_encoding part of the string that

    > might
    > > encoded in gbk and resolve the problem.
    > >
    > > I am wondering if there is a way I could do so without the "invalid bytes
    > > sequence" error?

    >
    > A string with a mixed encoding is difficult to handle. I think you
    > have these options
    >
    > 1. Ensure that the string does *not* contain mixed encoding (this
    > would be the first and best choice IMHO).
    >
    > 2. If you can't because you get the data from somewhere else, use
    > encoding BINARY as a diversion:
    >
    > mixed_content.force_encoding Encoding::BINARY
    > chunks = mixed_content.split /\t/
    > chunks[0].force_encoding Encoding::UTF_8
    > chunks[1].force_encoding Encoding::GBK
    >
    > or
    >
    > mixed_content.force_encoding Encoding::BINARY
    > a, b = mixed_content.split /\t/
    > a.force_encoding Encoding::UTF_8
    > b.force_encoding Encoding::GBK
    >
    > Kind regards
    >
    > robert
    >
    > --
    > remember.guy do |as, often| as.you_can - without end
    > http://blog.rubybestpractices.com/
    >
    >
    Stanley Xu, Mar 23, 2011
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    4
    Views:
    960
  2. Steven Bethard

    elementtree and gbk encoding

    Steven Bethard, Mar 14, 2006, in forum: Python
    Replies:
    12
    Views:
    996
    Steven Bethard
    Mar 15, 2006
  3. Zhongjian Lu
    Replies:
    1
    Views:
    581
    Fuzzyman
    Mar 17, 2006
  4. Pen Ttt
    Replies:
    0
    Views:
    161
    Pen Ttt
    Apr 15, 2010
  5. Pen Ttt
    Replies:
    0
    Views:
    138
    Pen Ttt
    Apr 16, 2010
Loading...

Share This Page