[perl-python] unicode study with unicodedata module

Discussion in 'Python' started by Xah Lee, Mar 15, 2005.

  1. Xah Lee

    Xah Lee Guest

    python has this nice unicodedata module that deals with unicode nicely.

    #-*- coding: utf-8 -*-
    # python

    from unicodedata import *

    # each unicode char has a unique name.
    # one can use the “lookup†func to find it

    mychar=lookup('greek cApital letter sIgma')
    # note letter case doesn't matter
    print mychar.encode('utf-8')

    m=lookup('CJK UNIFIED IDEOGRAPH-5929')
    # for some reason, case must be right here.
    print m.encode('utf-8')

    # to find a char's name, use the “name†function
    print name(u'天')

    basically, in unicode, each char has a number of attributes (called
    properties) besides its name. These attributes provides necessary info
    to form letters, words, or processing such as sorting, capitalization,
    etc, of varous human scripts. For example, Latin alphabets has two
    forms of upper case and lower case. Korean alphabets are stacked
    together. While many symbols corresponds to numbers, and there are also

    combining forms used for example to put a bar over any letter or
    character. Also some writings systems are directional. In order to form

    these symbols for display or process them for computing, info of these
    on each char is necessary.

    the rest of functions in unicodedata return these attributes.

    see unicodedata doc:
    http://python.org/doc/2.4/lib/module-unicodedata.html

    Official word on unicode character properties:
    http://www.unicode.org/uni2book/ch04.pdf

    --
    i don't know what's the state of Perl's unicode. Is there something
    similar?

    --
    this post is archived at
    http://xahlee.org/perl-python/unicodedata_module.html

    Xah

    http://xahlee.org/PageTwo_dir/more.html
     
    Xah Lee, Mar 15, 2005
    #1
    1. Advertising

  2. Xah Lee

    Xah Lee Guest

    Re: unicode study with unicodedata module

    how do i get a unicode's number?

    e.g. 03ba for greek lowercase kappa? (or in decimal form)

    Xah


    Xah Lee wrote:
    > python has this nice unicodedata module that deals with unicode

    nicely.
    >
    > #-*- coding: utf-8 -*-
    > # python
    >
    > from unicodedata import *
    >
    > # each unicode char has a unique name.
    > # one can use the “lookup†func to find it
    >
    > mychar=lookup('greek cApital letter sIgma')
    > # note letter case doesn't matter
    > print mychar.encode('utf-8')
    >
    > m=lookup('CJK UNIFIED IDEOGRAPH-5929')
    > # for some reason, case must be right here.
    > print m.encode('utf-8')
    >
    > # to find a char's name, use the “name†function
    > print name(u'天')
    >
    > basically, in unicode, each char has a number of attributes (called
    > properties) besides its name. These attributes provides necessary

    info
    > to form letters, words, or processing such as sorting,

    capitalization,
    > etc, of varous human scripts. For example, Latin alphabets has two
    > forms of upper case and lower case. Korean alphabets are stacked
    > together. While many symbols corresponds to numbers, and there are

    also
    >
    > combining forms used for example to put a bar over any letter or
    > character. Also some writings systems are directional. In order to

    form
    >
    > these symbols for display or process them for computing, info of

    these
    > on each char is necessary.
    >
    > the rest of functions in unicodedata return these attributes.
    >
    > see unicodedata doc:
    > http://python.org/doc/2.4/lib/module-unicodedata.html
    >
    > Official word on unicode character properties:
    > http://www.unicode.org/uni2book/ch04.pdf
    >
    > --
    > i don't know what's the state of Perl's unicode. Is there something
    > similar?
    >
    > --
    > this post is archived at
    > http://xahlee.org/perl-python/unicodedata_module.html
    >
    > Xah
    >
    > http://xahlee.org/PageTwo_dir/more.html
     
    Xah Lee, Mar 15, 2005
    #2
    1. Advertising

  3. Re: unicode study with unicodedata module

    On 15 Mar 2005 04:55:17 -0800, rumours say that "Xah Lee" <> might
    have written:

    >how do i get a unicode's number?
    >
    >e.g. 03ba for greek lowercase kappa? (or in decimal form)


    you get the character with:

    ..>> uc = u"\N{GREEK SMALL LETTER KAPPA}"

    or with

    ..>> uc = unicodedata.lookup("GREEK SMALL LETTER KAPPA")

    and you get the ordinal with:

    ..>> ord(uc)

    ord works for strings and unicode.
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
     
    Christos TZOTZIOY Georgiou, Mar 15, 2005
    #3
  4. Xah Lee wrote:

    > i don't know what's the state of Perl's unicode.


    perldoc perlunicode
     
    Brian McCauley, Mar 15, 2005
    #4
  5. Xah Lee

    Xah Lee Guest

    Re: unicode study with unicodedata module

    here's a snippet of code that prints a range of unicode chars, along
    with their ordinal in hex, and name.

    chars without a name are skipped. (some of such are undefined code
    points.)

    On Microsoft Windows the encoding might need to be changed to utf-16.

    Change the range to see different unicode chars.

    # -*- coding: utf-8 -*-

    from unicodedata import *

    l=[]
    for i in range(0x0000, 0x0fff):
    l.append(eval('u"\\u%04x"' % i))

    for x in l:
    if name(x,'-')!='-':
    print x.encode('utf-8'),'|', "%04x"%(ord(x)), '|', name(x,'-')
    --
    http://xahlee.org/perl-python/unicodedata_module.html

    anyone wants to supply a Perl version?

    Xah

    http://xahlee.org/PageTwo_dir/more.html



    Brian McCauley wrote:
    > Xah Lee wrote:
    >
    > > i don't know what's the state of Perl's unicode.

    >
    > perldoc perlunicode
     
    Xah Lee, Mar 16, 2005
    #5
  6. Xah Lee

    Xah Lee Guest

    Xah Lee, Mar 16, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Opstad

    Unicode 4.0 updates to unicodedata?

    David Opstad, Sep 18, 2003, in forum: Python
    Replies:
    1
    Views:
    304
    Martin v. =?iso-8859-15?q?L=F6wis?=
    Sep 19, 2003
  2. Ken Beesley

    unicodedata name for \u000a

    Ken Beesley, Aug 21, 2004, in forum: Python
    Replies:
    7
    Views:
    7,330
    Peter Otten
    Aug 22, 2004
  3. Weidong
    Replies:
    3
    Views:
    744
    Weidong
    Jul 28, 2009
  4. Robert P. J. Day
    Replies:
    0
    Views:
    335
    Robert P. J. Day
    Jul 25, 2011
  5. Xah Lee
    Replies:
    3
    Views:
    127
    Xah Lee
    Mar 16, 2005
Loading...

Share This Page