separate Chinese and English! with Ruby

Nanyang Zhan · May 7, 2007

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

æ‘©æ ¹Â·å¼—é‡Œæ›¼ Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯ Bruce Willis
æŽå°æ˜Ž Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

akbarhome · May 7, 2007

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

æ‘©æ ¹Â·å¼—é‡Œæ›¼ Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯ Bruce Willis
æŽå°æ˜Ž Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

a = File.open('a.txt')
a.each {|x| puts x.split(' ', 2) }
Output:
æ‘©æ ¹Â·å¼—é‡Œæ›¼
Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯
Bruce Willis
æŽå°æ˜Ž
Lee xiao ming

akbarhome · May 7, 2007

a = File.open('a.txt')
a.each {|x| puts x.split(' ', 2) }
Output:
æ‘©æ ¹Â·å¼—é‡Œæ›¼
Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯
Bruce Willis
æŽå°æ˜Ž
Lee xiao ming

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}

This code is quick and dirty.

Mariusz PÄ™kala · May 7, 2007

--W/nzBZO5zC0uMSeA
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:
=20
=E6=91=A9=E6=A0=B9=C2=B7=E5=BC=97=E9=87=8C=E6=9B=BC Morgan Freeman
=E5=B8=83=E9=B2=81=E6=96=AF=C2=B7=E5=A8=81=E5=88=A9=E6=96=AF Bruce Willis
=E6=9D=8E=E5=B0=8F=E6=98=8E Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name
=20
or
Frank Darabont
Just an English name.
=20
Would you give me an idea how to separate these Chinese characters(if
any)?

Maybe a regexp similiar to
/^([^qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP ]+)/
would help?

Does [a-zA-Z] include Chinese characters? In Polish locale it includes
Polish non-ASCII characters, so I guess it might include Chinese ones.

I guess you want split a given string into words (separated by space),
and then check whether the first word starts or includes at least one
Chinese character.

--=20
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

--W/nzBZO5zC0uMSeA
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6-ecc01.6 (GNU/Linux)

iD8DBQFGPvmRsnU0scoWZKARAnaLAJsGCJwgW5wc0JgwJwwQLtAHY0eMjwCfbdb9
Ky1++DV5VAmjTHKyzASqYTI=
=saZW
-----END PGP SIGNATURE-----

--W/nzBZO5zC0uMSeA--

Nanyang Zhan · May 7, 2007

Akbar said:
æŽå°æ˜Ž
Lee xiao ming

Click to expand...

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}

This code is quick and dirty.

Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ã”, Ã©, Ã¡... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

Harry Kakueki · May 7, 2007

T24gNS83LzA3LCBOYW55YW5nIFpoYW4gPHN4YWluQGhvdG1haWwuY29tPiB3cm90ZToKPiBEb24n
dCBnZXQgbWUgd3JvbmcsIGJlY2F1c2UgSSBqdXN0IHdhbnQgdG8ga25vdyBob3cgdG8gc2VwYXJh
dGUgRW5nbGlzaAo+IHdvcmRzIGZyb20gYSBzdHJpbmcgd2l0aCBydWJ5Lgo+IFRoZXJlIGFyZSBz
dHJpbmdzIChVVEYtOCBlbmNvZGVkKSB0byByZWNvcmQgcGVvcGxlJ3MgbmFtZSwKPiBsaWtlOgo+
Cj4gxKa4+aGkuKXA78L8IE1vcmdhbiBGcmVlbWFuCj4gsrzCs8u5oaTN/sD7y7kgQnJ1Y2UgV2ls
bGlzCj4gwO7QocP3IExlZSB4aWFvIG1pbmcKPiB0aGVzZSBzdHJpbmdzIGNvbnRhaW5pbmcgQ2hp
bmVzZSBuYW1lKHdpdGhvdXQgc3BhY2UgYmV0d2VlbiBjaGFyYWN0ZXJzKSwKPiBzZXBhcmF0ZWQg
YnkgYSBzcGFjZSwgZm9sbG93aW5nIGFuIEVuZ2xpc2ggbmFtZQo+Cj4gb3IKPiBGcmFuayBEYXJh
Ym9udAo+IEp1c3QgYW4gRW5nbGlzaCBuYW1lLgo+Cj4gV291bGQgeW91IGdpdmUgbWUgYW4gaWRl
YSBob3cgdG8gc2VwYXJhdGUgdGhlc2UgQ2hpbmVzZSBjaGFyYWN0ZXJzKGlmCj4gYW55KT8KPgo+
IC0tCj4gUG9zdGVkIHZpYSBodHRwOi8vd3d3LnJ1YnktZm9ydW0uY29tLy4KPgo+CgpUcnkgc29t
ZXRoaW5nIGxpa2UgdGhpcy4KCnQgPSBzdHIuc3BsaXQoLy8pLnBhcnRpdGlvbiB7fHh8IHg9fi9b
YS16XXxbQS1aXS8gfQpwIHRbMF0uam9pbgpwIHRbMV0uam9pbgoKSGFycnkKCi0tIApodHRwOi8v
d3d3Lmtha3Vla2kuY29tL3J1YnkvbGlzdC5odG1sCkEgTG9vayBpbnRvIEphcGFuZXNlIFJ1Ynkg
TGlzdCBpbiBFbmdsaXNoCg==

akbarhome · May 7, 2007

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}

Click to expand...

This code is quick and dirty.

Click to expand...

Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ã”, Ã©, Ã¡... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
http://www.khngai.com/chinese/charmap/tbluni.php

should get you done.

ustr
=> +"æ‘©æ ¹Â·å¼—é‡Œæ›¼"
irb(main):027:0> ustr[0]
=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>

Nanyang Zhan · May 7, 2007

Harry said:
Try something like this.

t = str.split(//).partition {|x| x=~/[a-z]|[A-Z]/ }
p t[0].join
p t[1].join

Harry

Thanks, KaKuEKi, but:
!!!!below code were tested under Ruby on Rails console!!!

str1 = "ä¸æ–‡ English Words" => "ä¸æ–‡ English Words"
str2 = "Ã”kami: chi" => "Ã”kami: chi"
t = str2.split(//).partition { |x| x=~/[a-z]|[A-Z]/} => [["k", "a", "m", "i", "c", "h", "i"], ["Ã”", ":", " "]]
p t[0].join

Click to expand...

"kamichi" ##########I want all non Chinese characters remained.
=> nil

t = str1.split(//).partition { |x| x=~/[a-z]|[A-Z]/}

Click to expand...

=> [["E", "n", "g", "l", "i", "s", "h", "W", "o", "r", "d", "s"], ["ä¸",
"æ–‡", " ", " "]]

p t[0].join

Click to expand...

"EnglishWords" #######no space
=> nil

Harry said:
Or this

str.split(//).partition {|x| x.length == 1 }

Harry

this time spaces are kept:=> [[" ", "E", "n", "g", "l", "i", "s", "h", " ", "W", "o", "r", "d",
"s"], ["ä¸", "æ–‡"]]

t[0].join => " English Words"
t = str2.split(//).partition {|x| x.length == 1 } => [["k", "a", "m", "i", ":", " ", "c", "h", "i"], ["Ã”"]]
t[0].join

Click to expand...

=> "kami: chi"

I think "Ã”" may just like Chinese characters, so it is hard to take it
out.

John Joyce · May 7, 2007

Akbar said:
Akbar said:

=E5=B8=83=E9=B2=81=E6=96=AF=C2=B7=E5=A8=81=E5=88=A9=E6=96=AF Bruce =

Click to expand...

Willis

=E6=9D=8E=E5=B0=8F=E6=98=8E
Lee xiao ming

Click to expand...

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}

Click to expand...

This code is quick and dirty.

Click to expand...

Thanks.
But I was wrong. There are more Characters than Chinese and =20
English that
compose the strings. Now I see characters like =C3=94, =C3=A9, =C3=A1..= if x =20
is one of
these, x[0]> 128 as Chinese does, but I only want to separate =20
Chinese.

so do you know what exactly range of the value Chinese Characters =20
will
return? or you can tell me where I can find this kind of information.

Click to expand...

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
http://www.khngai.com/chinese/charmap/tbluni.php

should get you done.

ustr
=3D> +"=E6=91=A9=E6=A0=B9=C2=B7=E5=BC=97=E9=87=8C=E6=9B=BC"
irb(main):027:0> ustr[0]
=3D> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=3D> "6469"
irb(main):029:0>

You could identify the encoding or just make it unicode, then check =20
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character's =20=

code.=20=

Nanyang Zhan · May 7, 2007

Akbar said:
These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
http://www.khngai.com/chinese/charmap/tbluni.php

should get you done.

str1 = "ä¸æ–‡ English Words" => "ä¸æ–‡ English Words"
str1[0] => 228
str2 = "Ã”kami: chi" => "Ã”kami: chi"
str2[0] => 195
str3 = "English Words" => "English Words"
str3[0]

Click to expand...

=> 69

if only I known which number Chinese Characters start and end...

Nanyang Zhan · May 7, 2007

John said:
if x[0].to_i > 128 then
English that
Posted viahttp://www.ruby-forum.com/.

Click to expand...

=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>

Click to expand...

You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character's
code.

John Joyce, Thank you for your explanation.
Now I get akbarhome's idea. So I need to download the unicode lib here
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:
http://www.khngai.com/chinese/charmap/tbluni.php?page=5
Yes,It must work!

but look this:

str1 = "ä¸æ–‡ English Words" => "ä¸æ–‡ English Words"
str1[0] => 228
str2 = "Ã”kami: chi" => "Ã”kami: chi"
str2[0] => 195
str3 = "English Words" => "English Words"
str3[0]

Click to expand...

=> 69

may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.

John Joyce · May 7, 2007

John said:
John said:

if x[0].to_i > 128 then
English that
Posted viahttp://www.ruby-forum.com/.
=3D> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=3D> "6469"
irb(main):029:0>

Click to expand...

You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify =20=
them.
One shortcut is checking for leading zeros in the unicode character's
code.

Click to expand...

John Joyce, Thank you for your explanation.
Now I get akbarhome's idea. So I need to download the unicode lib here
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:
http://www.khngai.com/chinese/charmap/tbluni.php?page=3D5
Yes,It must work!

but look this:

str1 =3D "=E4=B8=AD=E6=96=87 English Words" =3D> "=E4=B8=AD=E6=96=87 English Words"
str1[0] =3D> 228
str2 =3D "=C3=94kami: chi" =3D> "=C3=94kami: chi"
str2[0] =3D> 195
str3 =3D "English Words" =3D> "English Words"
str3[0]

Click to expand...

Click to expand...

=3D> 69

may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.

--=20
Posted via http://www.ruby-forum.com/.

yes, that's pretty much how unicode is supposed to work.
In theory you could take a sample range of characters to guess the =20
document language even.
The problem is that unicode allows multilanguage documents, which in =20
some cases is difficult because of fonts and systems' implementations.
But yes you're on the right track now (IMHO).

And yes, the overhead will be greater, but that's just a fact of =20
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and =20=

Simpllified.
If you were doing Japanese text, separating English or other western =20
languages wouldn't be so easy, since Japanese essentially includes a =20
number of other languages' character sets in its unicode set and in =20
everyday usage.=

Nanyang Zhan · May 7, 2007

John said:
And yes, the overhead will be greater, but that's just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

You are right. And let alone the characters, there is a different set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...

John Joyce · May 7, 2007

John said:
John said:

And yes, the overhead will be greater, but that's just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

Click to expand...

You are right. And let alone the characters, there is a different
set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...

there is a doc.
go to
www.unicode.org
There should be a pdf (many actually)
I don't know if the two main chinese sets are encoded as different
ranges or simply declared in some way.
In general in Unicode a character is the same character even when it
appears in a different language.

John Joyce · May 7, 2007

NZ,
You might want to check the RubyGems gem unihan
At the command line type:
gem list --remote uni
and it will show up.
then
gem install unihan --include-dependencies

I haven't checked it out yet, but after installing it, check the
documentation.
It seems to be an API to the Unihan online database.
Could be quite useful.

John Joyce

John Joyce · May 7, 2007

NZ
another English site on Unicode that may be easier to understand (it
was for me)
http://www.alanwood.net/unicode/index.html

There must surely be some docs in Chinese somewhere.
I know here in Japan there are many books on the subject. (in
Japanese) Since computer science in Japan does deal with it a lot.
I've been interested in this subject myself, but it is a big one.
Unicode.org published the print version of 5.0 and I have browsed the
book in the bookstore, it is worth checking out. Maybe a nearby
university library would have it also.

It certainly seems like a point where a compiled language would be
helpful, such as C
Most interpreted languages are only reaching partial unicode support
now because of the overhead of processing many languages and the
sheer volume of material to deal with, AND the various algorithms
necessary for languages whose writing depends on context. (arabic,
hebrew, indic languages, etc...)

Perhaps Perl and Ruby and Python and PHP should get hooks from Apple
and Microsoft to help these languages be more productive by using
their implementations.

Jaypee · May 8, 2007

John Joyce a écrit :

NZ
another English site on Unicode that may be easier to understand (it was
for me)
http://www.alanwood.net/unicode/index.html

May I also suggest this plain english introduction, I'm quoting:
The Absolute Mininmum Every Software Developer Absolutely, Positively
Must Know About Unicode and Caracter Sets (No Excuses!)
<http://www.joelonsoftware.com/articles/Unicode.html>
J-P

eden li · May 8, 2007

There is documentation:

ri String#[]

Although it is a little vague about what "character code" means. By
default (in ruby 1.8.x) the number returned by some_string is a
fixnum in the range [0,255] -- even for UTF-8 encoded strings. Ruby
will just treat the string as a string of 8-bit bytes and give you
back whatever byte you asked for.

irb(main):001:0> s =3D "=E5=A4=A7=E6=99=BA=E8=8B=A5=E6=84=9A"
=3D> "\345\244\247\346\231\272\350\213\245\346\204\232"
irb(main):002:0> s[0]
=3D> 229
irb(main):003:0> s.length
=3D> 12

John said:

And yes, the overhead will be greater, but that's just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

Click to expand...

You are right. And let alone the characters, there is a different set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...

Click to expand...

Zev Blut · May 8, 2007

If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

If the goal is to separate the western languages from the Japanese
Kanji and Kana, then it appears to not be too bad when using a lib
like this:

http://raa.ruby-lang.org/project/moji/

http://gimite.net/gimite/rubymess/moji.html

Zev

Nanyang Zhan · May 8, 2007

John said:
I don't know if the two main chinese sets are encoded as different
ranges or simply declared in some way.
In general in Unicode a character is the same character even when it
appears in a different language.

Many characters of these two set of Chinese(in fact, including Chinese
Characters in Japanese and Korean...) are the same. Aren't they encoded
to the same codes when they are identical?

Gary said:
I believe the range is (in hex) 3400 to 97A5

You must mean Unicode range.
http://www.khngai.com/chinese/charmap/tbluni.php?page=0

John said:
You might want to check the RubyGems gem unihan

.... hmmmmm.. if only I could find out what it does...

John said:
http://www.alanwood.net/unicode/index.html

I've been interested in this subject myself, but it is a big one.

Interesting subject indeed it is.

Today I tried this(!!!!under RoR console!!!!):=> ["â€œ", "â€ã€‚", "ï¼Œ", "ï¼", "ï¼œ", "ï½›", "ï¼›", "â€˜", "ï¼", "ï¼ ", "ï¼ƒ", "ï¼„", "ï¼…",
"â€¦", "ï¼Š", "ï¼ˆ", "ï¼‰", "ä¸€", "ä¿¿", "å€€", "å‡¿", "å‹¿", "å¿", "å“¿", "å›¿", "å§¿", " å¯¿",
"å´", "å¿„å¿¿", "æ˜", "æ‰‰", "æŽµ", "æ›†", "æ¡¶", "æª—", "æ³—", "æ¿—", "ç€–", "ç‡¿", "ç‹§", "ç—",
"ç—¿", "çœ€", "ç§Š", "ç«—", "ç¯¿", "ç´€", "ç¿¹", "é€€", "é‡½", "éŽ·", "é–ˆ", "é˜€", "éŸ—", "é¥§",
"éª ", "é¶†", "é¾¥"]

c.collect.map{|o| o[0]}

Click to expand...

=> [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239,
226, 239, 239, 239, 228, 228, 229, 229, 229, 229, 229, 229, 229, 229,
229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, 233,
233, 233, 233]

c.collect.map{|o| o[0]}.sort

Click to expand...

=> [226, 226, 226, 226, 228, 228, 229, 229, 229, 229, 229, 229, 229,
229, 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233,
233, 233, 233, 233, 239, 239, 239, 239, 239, 239, 239, 239, 239, 239,
239, 239, 239]

c.collect.map{|o| o[0]}.sort.uniq

Click to expand...

=> [226, 228, 229, 230, 231, 233, 239]

There punctuations are those commonly used in China.
There Chinese characters are randomly pickup from
http://www.khngai.com/chinese/charmap/tbluni.php?page=0
(from all the six pages.)

maybe 226 to 239 is the range I need.

Encounter troubles with Regex in Chinese text splitting	3	Dec 3, 2005
Python, Dutch, English, Chinese, Japanese, etc.	12	Jun 3, 2007
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Ruby Weekly News 18th - 24th April 2005	1	Apr 26, 2005
anybody help me	1	Feb 10, 2006
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 3, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

separate Chinese and English! with Ruby

Nanyang Zhan

akbarhome

akbarhome

Mariusz PÄ™kala

Nanyang Zhan

Harry Kakueki

akbarhome

Nanyang Zhan

John Joyce

Nanyang Zhan

Nanyang Zhan

John Joyce

Nanyang Zhan

John Joyce

John Joyce

John Joyce

Jaypee

eden li

Zev Blut

Nanyang Zhan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads