How to get the ascii code of Chinese characters?

many_years_after · Aug 19, 2006

Hi,everyone:

Have you any ideas?

Say whatever you know about this.

thanks.

Philippe Martin · Aug 19, 2006

many_years_after said:
Hi,everyone:

Have you any ideas?

Say whatever you know about this.

thanks.

Hi,

You mean unicode I assume:
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

Regards,

Philippe

John Machin · Aug 19, 2006

many_years_after said:
Hi,everyone:

Have you any ideas?

Say whatever you know about this.

Perhaps you had better explain what you mean by "ascii code of Chinese
characters". Chinese characters ("hanzi") can be represented in many
ways on a computer, in Unicode as well as many different "legacy"
encodings, such as GB, GBK, big5, two different 4-digit telegraph
codes, etc etc. They can also be spelled out in "roman" letters with or
without tone indications (digits or "accents") in the pinyin system --
is that what you mean by "ascii code"?

Perhaps you might like to tell us what you want to do in Python with
hanzi and "ascii codes", so that we can give you a specific answer.
With examples, please -- like what are the "ascii codes" for the two
characters in the common greeting that comes across in toneless pinyin
as "ni hao"?

Cheers,
John

Philippe Martin · Aug 19, 2006

Philippe said:
Hi,

You mean unicode I assume:
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

Regards,

Philippe

Hi,

I have received a personnal email on this:

Kanji is indeed a Japanese subset of the Chinese Character set.

I just thought it would be relevant as it includes ~47000 characters.

If I hurt any feeling, sorry.

Regards,

Philippe

many_years_after · Aug 19, 2006

hi:

what I want to do is just to make numbers as people input some Chinese
character(hanzi,i mean).The same character will create the same
number.So I think ascii code can do this very well.

Marc 'BlackJack' Rintsch · Aug 19, 2006

what I want to do is just to make numbers as people input some Chinese
character(hanzi,i mean).The same character will create the same
number.So I think ascii code can do this very well.

No it can't. ASCII doesn't contain Chinese characters.

http://en.wikipedia.org/wiki/ASCII

Ciao,
Marc 'BlackJack' Rintsch

Thorsten Kampe · Aug 19, 2006

* many_years_after (2006-08-19 12:18 +0100)

Hi,everyone:

Have you any ideas?

Say whatever you know about this.

contradictio in adiecto

Gerhard Fiedler · Aug 19, 2006

No it can't. ASCII doesn't contain Chinese characters.

Well, ASCII can represent the Unicode numerically -- if that is what the OP
wants. For example, "U+81EC" (all ASCII) is one possible -- not very
readable though <g> -- representation of a Hanzi character (see
http://www.cojak.org/index.php?function=code_lookup&term=81EC).

(I don't know anything about Hanzi or Mandarin... But that's Unicode, so
this works

Gerhard

Peter Maas · Aug 19, 2006

Gerhard said:
Well, ASCII can represent the Unicode numerically -- if that is what the OP
wants.

No. ASCII characters range is 0..127 while Unicode characters range is
at least 0..65535.

For example, "U+81EC" (all ASCII) is one possible -- not very
readable though <g> -- representation of a Hanzi character (see
http://www.cojak.org/index.php?function=code_lookup&term=81EC).

U+81EC means a Unicode character which is represented by the number
0x81EC. There are some encodings defined which map Unicode sequences
to byte sequences: UTF-8 maps Unicode strings to sequences of bytes in
the range 0..255, UTF-7 maps Unicode strings to sequences of bytes in
the range 0..127. You *could* read the latter as ASCII sequences
but this is not correct.

How to do it in Python? Let chinesePhrase be a Unicode string with
Chinese content. Then

chinesePhrase_7bit = chinesePhrase.encode('utf-7')

will produce a sequences of bytes in the range 0..127 representing
chinesePhrase and *looking like* a (meaningless) ASCII sequence.

chinesePhrase_16bit = chinesePhrase.encode('utf-16be')

will produce a sequence with Unicode numbers packed in a byte
string in big endian order. This is probably closest to what
the OP wants.

Peter Maas, Aachen

John Machin · Aug 19, 2006

hi:

what I want to do is just to make numbers as people input some Chinese
character(hanzi,i mean).The same character will create the same
number.So I think ascii code can do this very well.

*What* characters make *what* numbers? Stop thinking and give us some
*examples*

Gerhard Fiedler · Aug 19, 2006

No. ASCII characters range is 0..127 while Unicode characters range is
at least 0..65535.

Actually, Unicode goes beyond 65535. But right in this sentence, you
represented the number 65535 with ASCII characters, so it doesn't seem to
be impossible.

U+81EC means a Unicode character which is represented by the number
0x81EC.

Exactly. Both versions represented in ASCII right in your message

UTF-8 maps Unicode strings to sequences of bytes in the range 0..255,
UTF-7 maps Unicode strings to sequences of bytes in the range 0..127.
You *could* read the latter as ASCII sequences but this is not correct.

Of course not "correct". I guess the only "correct" representation is the
original Chinese character. But the OP doesn't seem to want this... so a
non-"correct" representation is necessary anyway.

How to do it in Python? Let chinesePhrase be a Unicode string with
Chinese content. Then

chinesePhrase_7bit = chinesePhrase.encode('utf-7')

will produce a sequences of bytes in the range 0..127 representing
chinesePhrase and *looking like* a (meaningless) ASCII sequence.

Actually, no. There are quite a few code positions in the range 0..127 that
don't "look like" anything (non-printable). And, as you say, this is rather
meaningless.

chinesePhrase_16bit = chinesePhrase.encode('utf-16be')

will produce a sequence with Unicode numbers packed in a byte
string in big endian order. This is probably closest to what
the OP wants.

That's what you think... but it's not really ASCII. If you want this in
ASCII, and readable, I still suggest to transform this sequence of 2-byte
values (for Chinese characters it will be 2 bytes per character) into a
sequence of something like U+81EC (or 0x81EC if you are a C fan or 81EC if
you can imply the rest)... that's where we come back to my original
suggestion

Gerhard

John Machin · Aug 19, 2006

many_years_after said:
hi:

what I want to do is just to make numbers as people input some Chinese
character(hanzi,i mean).The same character will create the same
number.So I think ascii code can do this very well.

Possibly you have "create" upside-down. Could you possibly be talking
about an "input method", in which people type in ascii letters (and
maybe numbers) and the *result* is a Chinese character? In other words,
what *everybody* uses to input Chinese characters?

Perhaps you could ask on the Chinese Python newsgroup.

*GIVE* *EXAMPLES* of what you want to do.

many_years_after · Aug 19, 2006

John said:
Possibly you have "create" upside-down. Could you possibly be talking
about an "input method", in which people type in ascii letters (and
maybe numbers) and the *result* is a Chinese character? In other words,
what *everybody* uses to input Chinese characters?

Perhaps you could ask on the Chinese Python newsgroup.

*GIVE* *EXAMPLES* of what you want to do.

Well, people may input from keyboard. They input some Chinese
characters, then, I want to create a number. The same number will be
created if they input the same Chinese characters.

Dennis Lee Bieber · Aug 19, 2006

Well, people may input from keyboard. They input some Chinese
characters, then, I want to create a number. The same number will be
created if they input the same Chinese characters.

Still meaningless... Are they using some big keyboard with some 5000
individual keys (one per character/character-component)? Are they using
the "Arial Unicode MS" font from the character map and cut&pasting
selected CJK characters? Are they holding down an alt key and entering
four digits from the numeric pad? (and are the four digits what
character map displays for the glyph -- and is it the U+xxxx or the
0Xxxxx number; though both of those are in Hex, and the numeric pad is
plain decimal).

Are you asking for the Unicode character, of however many bytes, to
be treated as an N-byte integer and converted to a decimal
representation of the integer value?
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Ben Finney · Aug 19, 2006

many_years_after said:
Well, people may input from keyboard. They input some Chinese
characters, then, I want to create a number. The same number will be
created if they input the same Chinese characters.

You seem to be looking for a hash.

<URL:http://docs.python.org/lib/module-md5>
<URL:http://docs.python.org/lib/module-sha>

If not, please tell us what your *purpose* is. It's not at all clear
from your questions what you are trying to achieve.

Fredrik Lundh · Aug 20, 2006

Gerhard said:
Actually, Unicode goes beyond 65535.

you may want to look up "at least" in a dictionary.

</F>

Fredrik Lundh · Aug 20, 2006

many_years_after said:
Well, people may input from keyboard. They input some Chinese
characters, then, I want to create a number. The same number will be
created if they input the same Chinese characters.

assuming you mean "code point" rather than "ASCII code" (ASCII is a
specific encoding that *doesn't* include Chinese characters), "ord" is
what you want:

char = read_from_some_input_device()
code = ord(char)

see:

http://pyref.infogami.com/ord

</F>

Lawrence D'Oliveiro · Aug 20, 2006

you may want to look up "at least" in a dictionary.

Maybe you need to do the same for "actually".

Gerhard Fiedler · Aug 20, 2006

you may want to look up "at least" in a dictionary.

As a homework, try to parse "at least until" and "goes beyond" and compare
the two (a dictionary is not necessarily of help with this

"range is least 0..65535" : upper_bound >= 65535
"goes beyond 65535" : upper_bound > 65535

For some discussions (like how to represent code points etc) this
distinction is crucial.

Gerhard

Fredrik Lundh · Aug 20, 2006

Gerhard said:
As a homework, try to parse "at least until" and "goes beyond" and compare
the two (a dictionary is not necessarily of help with this

"range is least 0..65535" : upper_bound >= 65535
"goes beyond 65535" : upper_bound > 65535

For some discussions (like how to represent code points etc) this
distinction is crucial.

do you know anything about how Unicode is used in real life, or are you
just squabbling ?

</F>

How do i convert a Chinese DAT file from a game I play	2	Feb 4, 2022
SMPP sending chinese message to smsc	1	Jan 15, 2018
Python - limiting input to certain characters	7	Jan 18, 2025
Chinese characters library for C / ARM	0	Feb 4, 2015
Hello all! Noob here with completely unrealistic ambitions. Happy to join the crew and get good enough to help others.	4	Aug 13, 2024
Sort by number of characters	1	Nov 2, 2023
Sort by number of characters	0	Nov 3, 2023
Using GIT to get remote code	1	Dec 30, 2021

How to get the ascii code of Chinese characters?

many_years_after

Philippe Martin

John Machin

Philippe Martin

many_years_after

Marc 'BlackJack' Rintsch

Thorsten Kampe

Gerhard Fiedler

Peter Maas

John Machin

Gerhard Fiedler

John Machin

many_years_after

Dennis Lee Bieber

Ben Finney

Fredrik Lundh

Fredrik Lundh

Lawrence D'Oliveiro

Gerhard Fiedler

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads