Encounter troubles with Regex in Chinese text splitting

Mike Meng · Dec 3, 2005

Hi All,
I'm a Ruby newbie. I'm writting a program to process a big chunk of
Chinese text. The first step is to split the chunk of text into a list
of sentences. In Chinese, all the characters are listed one by one
without any natural boundary tag like space in English. Sentences are
separated by one of three special characters(ã€‚ï¼Ÿï¼). So at the
first glance, I thought it's a simple task:

# $chunk stores the text body
$sentenses = $chunk.split(/ã€‚|ï¼Ÿ|ï¼/)
# now $sentenses holds the list of sentences.

By when I checked the result, I found some of the sentenses didn't
split well. For instance, here is a sentense:
"ä½ æ²¡ç—…ï¼Œä»–å‘¢ï¼Ÿ" (means "You are not sick, how about him?") . In
GB2312, "ç—…ï¼Œ" is encoded to (hex) b2a1 a3ac, and "ã€‚"happens to be
encoded to (hex) a1a3. So the String#split method finds there is a
"ã€‚" in the middle of the sentense and incorrectly do the splitting.

Certainly this is because the String#split (and the Ruby regex
engine) is byte-oriented instead of true character-oriented, and it's a
frequent problem in i18n domain. Is there any ways in Ruby to correct
split Chinese text?

Thanks in advance.

myan

Park Heesob · Dec 3, 2005

Hi,

From: "Mike Meng" <[email protected]>
Reply-To: (e-mail address removed)
To: (e-mail address removed) (ruby-talk ML)
Subject: Encounter troubles with Regex in Chinese text splitting
Date: Sat, 3 Dec 2005 14:42:31 +0900

Hi All,
I'm a Ruby newbie. I'm writting a program to process a big chunk of
Chinese text. The first step is to split the chunk of text into a list
of sentences. In Chinese, all the characters are listed one by one
without any natural boundary tag like space in English. Sentences are
separated by one of three special characters(?‚ï¼Ÿï¼?. So at the
first glance, I thought it's a simple task:

# $chunk stores the text body
$sentenses = $chunk.split(/??ï¼?ï¼?)
# now $sentenses holds the list of sentences.

By when I checked the result, I found some of the sentenses didn't
split well. For instance, here is a sentense:
"ä½ æ²¡?…ï¼Œä»–å‘¢ï¼?quot; (means "You are not sick, how about him?") . In
GB2312, "?…ï¼Œ" is encoded to (hex) b2a1 a3ac, and "??quot; happens to be
encoded to (hex) a1a3. So the String#split method finds there is a
"??quot; in the middle of the sentense and incorrectly do the splitting.

Certainly this is because the String#split (and the Ruby regex
engine) is byte-oriented instead of true character-oriented, and it's a
frequent problem in i18n domain. Is there any ways in Ruby to correct
split Chinese text?

Thanks in advance.

myan

Try the script with $KCODE = "E"

Hope this help,

Park Heesob

Mike Meng · Dec 3, 2005

Hi Park,
It works. Thank you very much!

Could you please tell me the reason and where can I find relevant
documents?

Thank you.

myan

Park Heesob · Dec 3, 2005

Hi,
----- Original Message -----
From: "Mike Meng" <[email protected]>
Newsgroups: comp.lang.ruby
To: "ruby-talk ML" <[email protected]>
Sent: Saturday, December 03, 2005 10:07 PM
Subject: Re: Encounter troubles with Regex in Chinese text splitting

Hi Park,
It works. Thank you very much!

Could you please tell me the reason and where can I find relevant
documents?

Thank you.

myan

$KCODE is the character coding system Ruby handles. If the first character
of $KCODE is `e' or `E', Ruby handles EUC. If it is `s' or `S', Ruby handles
Shift_JIS. If it is `u' or `U', Ruby handles UTF-8. If it is `n' or `N',
Ruby doesn't handle multi-byte characters. The default value is "NONE".

Regards,

Park Heesob

reformatting a text file that has some binary in it	19	Apr 15, 2009
Splitting Tree	2	Dec 2, 2012
Chinese characters library for C / ARM	0	Feb 4, 2015
splitting file/content into lines based on regex termination	0	Nov 7, 2013
ruby chinese character encoding problem with SQL	1	Jun 14, 2007
Lyrics and Chinese in Ruby?	1	Jan 21, 2008
regex help: splitting string gets weird groups	8	Apr 8, 2010
Unicode in Regex	32	Nov 30, 2007

Encounter troubles with Regex in Chinese text splitting

Mike Meng

Park Heesob

Mike Meng

Park Heesob

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads