John said:
I don't know if the two main chinese sets are encoded as different
ranges or simply declared in some way.
In general in Unicode a character is the same character even when it
appears in a different language.
Many characters of these two set of Chinese(in fact, including Chinese
Characters in Japanese and Korean...) are the same. Aren't they encoded
to the same codes when they are identical?
Gary said:
I believe the range is (in hex) 3400 to 97A5
You must mean Unicode range.
http://www.khngai.com/chinese/charmap/tbluni.php?page=0
John said:
You might want to check the RubyGems gem unihan
.... hmmmmm.. if only I could find out what it does...
John said:
I've been interested in this subject myself, but it is a big one.
Interesting subject indeed it is.
Today I tried this(!!!!under RoR console!!!!):=> ["“", "â€ã€‚", ",", "ï¼", "<", "ï½›", "ï¼›", "‘", "ï¼", "ï¼ ", "#", "$", "ï¼…",
"…", "*", "(", ")", "一", "ä¿¿", "倀", "凿", "å‹¿", "å¿", "å“¿", "囿", "å§¿", " 寿",
"å´", "å¿„å¿¿", "æ˜", "扉", "掵", "曆", "æ¡¶", "檗", "æ³—", "æ¿—", "瀖", "燿", "ç‹§", "ç—",
"痿", "眀", "秊", "竗", "篿", "紀", "翹", "退", "釽", "鎷", "閈", "阀", "韗", "饧",
"éª ", "鶆", "é¾¥"]
=> [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239,
226, 239, 239, 239, 228, 228, 229, 229, 229, 229, 229, 229, 229, 229,
229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, 233,
233, 233, 233]
c.collect.map{|o| o[0]}.sort
=> [226, 226, 226, 226, 228, 228, 229, 229, 229, 229, 229, 229, 229,
229, 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233,
233, 233, 233, 233, 239, 239, 239, 239, 239, 239, 239, 239, 239, 239,
239, 239, 239]
c.collect.map{|o| o[0]}.sort.uniq
=> [226, 228, 229, 230, 231, 233, 239]
There punctuations are those commonly used in China.
There Chinese characters are randomly pickup from
http://www.khngai.com/chinese/charmap/tbluni.php?page=0
(from all the six pages.)
maybe 226 to 239 is the range I need.