separate Chinese and English! with Ruby

N

Nanyang Zhan

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

摩根·弗里曼 Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯ Bruce Willis
æŽå°æ˜Ž Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?
 
A

akbarhome

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

摩根·弗里曼 Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯ Bruce Willis
æŽå°æ˜Ž Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

a = File.open('a.txt')
a.each {|x| puts x.split(' ', 2) }
Output:
摩根·弗里曼
Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯
Bruce Willis
æŽå°æ˜Ž
Lee xiao ming
 
A

akbarhome

a = File.open('a.txt')
a.each {|x| puts x.split(' ', 2) }
Output:
摩根·弗里曼
Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯
Bruce Willis
æŽå°æ˜Ž
Lee xiao ming

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}

This code is quick and dirty.
 
M

Mariusz Pękala

--W/nzBZO5zC0uMSeA
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:
=20
=E6=91=A9=E6=A0=B9=C2=B7=E5=BC=97=E9=87=8C=E6=9B=BC Morgan Freeman
=E5=B8=83=E9=B2=81=E6=96=AF=C2=B7=E5=A8=81=E5=88=A9=E6=96=AF Bruce Willis
=E6=9D=8E=E5=B0=8F=E6=98=8E Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name
=20
or
Frank Darabont
Just an English name.
=20
Would you give me an idea how to separate these Chinese characters(if
any)?

Maybe a regexp similiar to
/^([^qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP ]+)/
would help?

Does [a-zA-Z] include Chinese characters? In Polish locale it includes
Polish non-ASCII characters, so I guess it might include Chinese ones.

I guess you want split a given string into words (separated by space),
and then check whether the first word starts or includes at least one
Chinese character.

--=20
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

--W/nzBZO5zC0uMSeA
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6-ecc01.6 (GNU/Linux)

iD8DBQFGPvmRsnU0scoWZKARAnaLAJsGCJwgW5wc0JgwJwwQLtAHY0eMjwCfbdb9
Ky1++DV5VAmjTHKyzASqYTI=
=saZW
-----END PGP SIGNATURE-----

--W/nzBZO5zC0uMSeA--
 
N

Nanyang Zhan

Akbar said:
æŽå°æ˜Ž
Lee xiao ming

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}

This code is quick and dirty.
Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ô, é, á... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.
 
H

Harry Kakueki

T24gNS83LzA3LCBOYW55YW5nIFpoYW4gPHN4YWluQGhvdG1haWwuY29tPiB3cm90ZToKPiBEb24n
dCBnZXQgbWUgd3JvbmcsIGJlY2F1c2UgSSBqdXN0IHdhbnQgdG8ga25vdyBob3cgdG8gc2VwYXJh
dGUgRW5nbGlzaAo+IHdvcmRzIGZyb20gYSBzdHJpbmcgd2l0aCBydWJ5Lgo+IFRoZXJlIGFyZSBz
dHJpbmdzIChVVEYtOCBlbmNvZGVkKSB0byByZWNvcmQgcGVvcGxlJ3MgbmFtZSwKPiBsaWtlOgo+
Cj4gxKa4+aGkuKXA78L8IE1vcmdhbiBGcmVlbWFuCj4gsrzCs8u5oaTN/sD7y7kgQnJ1Y2UgV2ls
bGlzCj4gwO7QocP3IExlZSB4aWFvIG1pbmcKPiB0aGVzZSBzdHJpbmdzIGNvbnRhaW5pbmcgQ2hp
bmVzZSBuYW1lKHdpdGhvdXQgc3BhY2UgYmV0d2VlbiBjaGFyYWN0ZXJzKSwKPiBzZXBhcmF0ZWQg
YnkgYSBzcGFjZSwgZm9sbG93aW5nIGFuIEVuZ2xpc2ggbmFtZQo+Cj4gb3IKPiBGcmFuayBEYXJh
Ym9udAo+IEp1c3QgYW4gRW5nbGlzaCBuYW1lLgo+Cj4gV291bGQgeW91IGdpdmUgbWUgYW4gaWRl
YSBob3cgdG8gc2VwYXJhdGUgdGhlc2UgQ2hpbmVzZSBjaGFyYWN0ZXJzKGlmCj4gYW55KT8KPgo+
IC0tCj4gUG9zdGVkIHZpYSBodHRwOi8vd3d3LnJ1YnktZm9ydW0uY29tLy4KPgo+CgpUcnkgc29t
ZXRoaW5nIGxpa2UgdGhpcy4KCnQgPSBzdHIuc3BsaXQoLy8pLnBhcnRpdGlvbiB7fHh8IHg9fi9b
YS16XXxbQS1aXS8gfQpwIHRbMF0uam9pbgpwIHRbMV0uam9pbgoKSGFycnkKCi0tIApodHRwOi8v
d3d3Lmtha3Vla2kuY29tL3J1YnkvbGlzdC5odG1sCkEgTG9vayBpbnRvIEphcGFuZXNlIFJ1Ynkg
TGlzdCBpbiBFbmdsaXNoCg==
 
A

akbarhome

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}
This code is quick and dirty.

Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ô, é, á... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
http://www.khngai.com/chinese/charmap/tbluni.php

should get you done.

ustr
=> +"摩根·弗里曼"
irb(main):027:0> ustr[0]
=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>
 
N

Nanyang Zhan

Harry said:
Try something like this.

t = str.split(//).partition {|x| x=~/[a-z]|[A-Z]/ }
p t[0].join
p t[1].join

Harry
Thanks, KaKuEKi, but:
!!!!below code were tested under Ruby on Rails console!!!
str1 = "中文 English Words" => "中文 English Words"
str2 = "Ôkami: chi" => "Ôkami: chi"
t = str2.split(//).partition { |x| x=~/[a-z]|[A-Z]/} => [["k", "a", "m", "i", "c", "h", "i"], ["Ô", ":", " "]]
p t[0].join
"kamichi" ##########I want all non Chinese characters remained.
=> nil
t = str1.split(//).partition { |x| x=~/[a-z]|[A-Z]/}
=> [["E", "n", "g", "l", "i", "s", "h", "W", "o", "r", "d", "s"], ["中",
"æ–‡", " ", " "]]
"EnglishWords" #######no space
=> nil
Harry said:
Or this

str.split(//).partition {|x| x.length == 1 }

Harry

this time spaces are kept:=> [[" ", "E", "n", "g", "l", "i", "s", "h", " ", "W", "o", "r", "d",
"s"], ["中", "文"]]
t[0].join => " English Words"
t = str2.split(//).partition {|x| x.length == 1 } => [["k", "a", "m", "i", ":", " ", "c", "h", "i"], ["Ô"]]
t[0].join
=> "kami: chi"

I think "Ô" may just like Chinese characters, so it is hard to take it
out.
 
J

John Joyce

Akbar said:
=E5=B8=83=E9=B2=81=E6=96=AF=C2=B7=E5=A8=81=E5=88=A9=E6=96=AF Bruce =
Willis
=E6=9D=8E=E5=B0=8F=E6=98=8E
Lee xiao ming
Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}
This code is quick and dirty.

Thanks.
But I was wrong. There are more Characters than Chinese and =20
English that
compose the strings. Now I see characters like =C3=94, =C3=A9, =C3=A1..= if x =20
is one of
these, x[0]> 128 as Chinese does, but I only want to separate =20
Chinese.

so do you know what exactly range of the value Chinese Characters =20
will
return? or you can tell me where I can find this kind of information.

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
http://www.khngai.com/chinese/charmap/tbluni.php

should get you done.

ustr
=3D> +"=E6=91=A9=E6=A0=B9=C2=B7=E5=BC=97=E9=87=8C=E6=9B=BC"
irb(main):027:0> ustr[0]
=3D> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=3D> "6469"
irb(main):029:0>
You could identify the encoding or just make it unicode, then check =20
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character's =20=

code.=20=
 
N

Nanyang Zhan

N

Nanyang Zhan

John said:
if x[0].to_i > 128 then
English that
Posted viahttp://www.ruby-forum.com/.
=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>
You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character's
code.

John Joyce, Thank you for your explanation.
Now I get akbarhome's idea. So I need to download the unicode lib here
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:
http://www.khngai.com/chinese/charmap/tbluni.php?page=5
Yes,It must work!

but look this:
str1 = "中文 English Words" => "中文 English Words"
str1[0] => 228
str2 = "Ôkami: chi" => "Ôkami: chi"
str2[0] => 195
str3 = "English Words" => "English Words"
str3[0]
=> 69

may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.
 
J

John Joyce

John said:
if x[0].to_i > 128 then
English that
Posted viahttp://www.ruby-forum.com/.
=3D> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=3D> "6469"
irb(main):029:0>
You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify =20=
them.
One shortcut is checking for leading zeros in the unicode character's
code.

John Joyce, Thank you for your explanation.
Now I get akbarhome's idea. So I need to download the unicode lib here
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:
http://www.khngai.com/chinese/charmap/tbluni.php?page=3D5
Yes,It must work!

but look this:
str1 =3D "=E4=B8=AD=E6=96=87 English Words" =3D> "=E4=B8=AD=E6=96=87 English Words"
str1[0] =3D> 228
str2 =3D "=C3=94kami: chi" =3D> "=C3=94kami: chi"
str2[0] =3D> 195
str3 =3D "English Words" =3D> "English Words"
str3[0]
=3D> 69

may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.

--=20
Posted via http://www.ruby-forum.com/.
yes, that's pretty much how unicode is supposed to work.
In theory you could take a sample range of characters to guess the =20
document language even.
The problem is that unicode allows multilanguage documents, which in =20
some cases is difficult because of fonts and systems' implementations.
But yes you're on the right track now (IMHO).

And yes, the overhead will be greater, but that's just a fact of =20
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and =20=

Simpllified.
If you were doing Japanese text, separating English or other western =20
languages wouldn't be so easy, since Japanese essentially includes a =20
number of other languages' character sets in its unicode set and in =20
everyday usage.=
 
N

Nanyang Zhan

John said:
And yes, the overhead will be greater, but that's just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

You are right. And let alone the characters, there is a different set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...
 
J

John Joyce

John said:
And yes, the overhead will be greater, but that's just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

You are right. And let alone the characters, there is a different
set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...
there is a doc.
go to
www.unicode.org
There should be a pdf (many actually)
I don't know if the two main chinese sets are encoded as different
ranges or simply declared in some way.
In general in Unicode a character is the same character even when it
appears in a different language.
 
J

John Joyce

NZ,
You might want to check the RubyGems gem unihan
At the command line type:
gem list --remote uni
and it will show up.
then
gem install unihan --include-dependencies

I haven't checked it out yet, but after installing it, check the
documentation.
It seems to be an API to the Unihan online database.
Could be quite useful.

John Joyce
 
J

John Joyce

NZ
another English site on Unicode that may be easier to understand (it
was for me)
http://www.alanwood.net/unicode/index.html

There must surely be some docs in Chinese somewhere.
I know here in Japan there are many books on the subject. (in
Japanese) Since computer science in Japan does deal with it a lot.
I've been interested in this subject myself, but it is a big one.
Unicode.org published the print version of 5.0 and I have browsed the
book in the bookstore, it is worth checking out. Maybe a nearby
university library would have it also.

It certainly seems like a point where a compiled language would be
helpful, such as C
Most interpreted languages are only reaching partial unicode support
now because of the overhead of processing many languages and the
sheer volume of material to deal with, AND the various algorithms
necessary for languages whose writing depends on context. (arabic,
hebrew, indic languages, etc...)

Perhaps Perl and Ruby and Python and PHP should get hooks from Apple
and Microsoft to help these languages be more productive by using
their implementations.
 
E

eden li

There is documentation:

ri String#[]

Although it is a little vague about what "character code" means. By
default (in ruby 1.8.x) the number returned by some_string is a
fixnum in the range [0,255] -- even for UTF-8 encoded strings. Ruby
will just treat the string as a string of 8-bit bytes and give you
back whatever byte you asked for.

irb(main):001:0> s =3D "=E5=A4=A7=E6=99=BA=E8=8B=A5=E6=84=9A"
=3D> "\345\244\247\346\231\272\350\213\245\346\204\232"
irb(main):002:0> s[0]
=3D> 229
irb(main):003:0> s.length
=3D> 12

John said:
And yes, the overhead will be greater, but that's just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

You are right. And let alone the characters, there is a different set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...
 
Z

Zev Blut

N

Nanyang Zhan

John said:
I don't know if the two main chinese sets are encoded as different
ranges or simply declared in some way.
In general in Unicode a character is the same character even when it
appears in a different language.

Many characters of these two set of Chinese(in fact, including Chinese
Characters in Japanese and Korean...) are the same. Aren't they encoded
to the same codes when they are identical?

Gary said:
I believe the range is (in hex) 3400 to 97A5
You must mean Unicode range.
http://www.khngai.com/chinese/charmap/tbluni.php?page=0

John said:
You might want to check the RubyGems gem unihan
.... hmmmmm.. if only I could find out what it does...
John said:
I've been interested in this subject myself, but it is a big one.

Interesting subject indeed it is.

Today I tried this(!!!!under RoR console!!!!):=> ["“", "â€ã€‚", ",", "ï¼", "<", "ï½›", "ï¼›", "‘", "ï¼", "ï¼ ", "#", "$", "ï¼…",
"…", "*", "(", ")", "一", "ä¿¿", "倀", "凿", "å‹¿", "å¿", "å“¿", "囿", "å§¿", " 寿",
"å´", "å¿„å¿¿", "æ˜", "扉", "掵", "曆", "æ¡¶", "檗", "æ³—", "æ¿—", "瀖", "燿", "ç‹§", "ç—",
"痿", "眀", "秊", "竗", "篿", "紀", "翹", "退", "釽", "鎷", "閈", "阀", "韗", "饧",
"骠", "鶆", "龥"]
c.collect.map{|o| o[0]}
=> [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239,
226, 239, 239, 239, 228, 228, 229, 229, 229, 229, 229, 229, 229, 229,
229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, 233,
233, 233, 233]
c.collect.map{|o| o[0]}.sort
=> [226, 226, 226, 226, 228, 228, 229, 229, 229, 229, 229, 229, 229,
229, 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233,
233, 233, 233, 233, 239, 239, 239, 239, 239, 239, 239, 239, 239, 239,
239, 239, 239]
c.collect.map{|o| o[0]}.sort.uniq
=> [226, 228, 229, 230, 231, 233, 239]

There punctuations are those commonly used in China.
There Chinese characters are randomly pickup from
http://www.khngai.com/chinese/charmap/tbluni.php?page=0
(from all the six pages.)

maybe 226 to 239 is the range I need.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top