How to use String.split to split a mixed encoding string(partencoded in gbk, part encoded in utf-8)

S

Stanley Xu

[Note: parts of this message were removed to make it a legal post.]

Dear Buddies,

Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found out
that the string is encoded in gbk and part of the string is encoded in
utf-8.

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that might
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the "invalid bytes
sequence" error?

Thanks.

Best wishes,
Stanley Xu
 
R

Robert Klemme

Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found out
that the string is encoded in gbk and part of the string is encoded in
utf-8.

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that might
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the "invalid bytes
sequence" error?

A string with a mixed encoding is difficult to handle. I think you
have these options

1. Ensure that the string does *not* contain mixed encoding (this
would be the first and best choice IMHO).

2. If you can't because you get the data from somewhere else, use
encoding BINARY as a diversion:

mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK

or

mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK

Kind regards

robert
 
S

Stanley Xu

[Note: parts of this message were removed to make it a legal post.]

Thanks a lot, Robert. Your solution really helps.

Best wishes,
Stanley Xu



Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found out
that the string is encoded in gbk and part of the string is encoded in
utf-8.

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that might
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the "invalid bytes
sequence" error?

A string with a mixed encoding is difficult to handle. I think you
have these options

1. Ensure that the string does *not* contain mixed encoding (this
would be the first and best choice IMHO).

2. If you can't because you get the data from somewhere else, use
encoding BINARY as a diversion:

mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK

or

mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top