[perl-python] unicode study with unicodedata module

X

Xah Lee

python has this nice unicodedata module that deals with unicode nicely.

#-*- coding: utf-8 -*-
# python

from unicodedata import *

# each unicode char has a unique name.
# one can use the “lookup†func to find it

mychar=lookup('greek cApital letter sIgma')
# note letter case doesn't matter
print mychar.encode('utf-8')

m=lookup('CJK UNIFIED IDEOGRAPH-5929')
# for some reason, case must be right here.
print m.encode('utf-8')

# to find a char's name, use the “name†function
print name(u'天')

basically, in unicode, each char has a number of attributes (called
properties) besides its name. These attributes provides necessary info
to form letters, words, or processing such as sorting, capitalization,
etc, of varous human scripts. For example, Latin alphabets has two
forms of upper case and lower case. Korean alphabets are stacked
together. While many symbols corresponds to numbers, and there are also

combining forms used for example to put a bar over any letter or
character. Also some writings systems are directional. In order to form

these symbols for display or process them for computing, info of these
on each char is necessary.

the rest of functions in unicodedata return these attributes.

see unicodedata doc:
http://python.org/doc/2.4/lib/module-unicodedata.html

Official word on unicode character properties:
http://www.unicode.org/uni2book/ch04.pdf

--
i don't know what's the state of Perl's unicode. Is there something
similar?

--
this post is archived at
http://xahlee.org/perl-python/unicodedata_module.html

Xah
(e-mail address removed)
http://xahlee.org/PageTwo_dir/more.html
 
X

Xah Lee

how do i get a unicode's number?

e.g. 03ba for greek lowercase kappa? (or in decimal form)

Xah
 
C

Christos TZOTZIOY Georgiou

how do i get a unicode's number?

e.g. 03ba for greek lowercase kappa? (or in decimal form)

you get the character with:

..>> uc = u"\N{GREEK SMALL LETTER KAPPA}"

or with

..>> uc = unicodedata.lookup("GREEK SMALL LETTER KAPPA")

and you get the ordinal with:

..>> ord(uc)

ord works for strings and unicode.
 
X

Xah Lee

here's a snippet of code that prints a range of unicode chars, along
with their ordinal in hex, and name.

chars without a name are skipped. (some of such are undefined code
points.)

On Microsoft Windows the encoding might need to be changed to utf-16.

Change the range to see different unicode chars.

# -*- coding: utf-8 -*-

from unicodedata import *

l=[]
for i in range(0x0000, 0x0fff):
l.append(eval('u"\\u%04x"' % i))

for x in l:
if name(x,'-')!='-':
print x.encode('utf-8'),'|', "%04x"%(ord(x)), '|', name(x,'-')
--
http://xahlee.org/perl-python/unicodedata_module.html

anyone wants to supply a Perl version?

Xah
(e-mail address removed)
http://xahlee.org/PageTwo_dir/more.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top