trying to understand unicode

Discussion in 'Python' started by F. Petitjean, Apr 20, 2005.

  1. F. Petitjean

    F. Petitjean Guest

    Python has a very good support of unicode, utf8, encodings ... But I
    have some difficulties with the concepts and the vocabulary. The
    documentation is not bad, but for example in reading
    http://docs.python.org/lib/module-unicodedata.html
    I had a long time to figure out what unicodedata.digit(unichr) would
    mean, a simple example is badly lacking.

    So I wrote the following script :

    #!/usr/bin/env python

    """Example of use of the unicodedata module
    http://docs.python.org/lib/module-unicodedata.html
    """

    import unicodedata
    import sys

    # outcodec = 'latin_1'
    outcodec = 'iso8859_15'
    if len(sys.argv) > 1:
    outcodec = sys.argv[1]

    for c in range(256):
    uc = unichr(c)
    uname = unicodedata.name(uc, None)
    if uname:
    unfd = unicodedata.normalize('NFD', uc).encode(outcodec,
    'replace')
    unfc = unicodedata.normalize('NFC', uc).encode(outcodec,
    'replace')
    print str(c).ljust(3), uname.ljust(42), unfd.ljust(2),
    unfc.ljust(2), \
    unicodedata.category(uc), unicodedata.numeric(uc, None)


    and here are some samples of output
    44 COMMA , , Po None
    45 HYPHEN-MINUS - - Pd None
    46 FULL STOP . . Po None
    47 SOLIDUS / / Po None
    48 DIGIT ZERO 0 0 Nd 0.0
    49 DIGIT ONE 1 1 Nd 1.0
    50 DIGIT TWO 2 2 Nd 2.0

    It seems that 'Nd' category means Numerical digit doh!

    64 COMMERCIAL AT @ @ Po None
    65 LATIN CAPITAL LETTER A A A Lu None
    66 LATIN CAPITAL LETTER B B B Lu None

    'Lu' should read 'Letter upper' ?

    94 CIRCUMFLEX ACCENT ^ ^ Sk None
    95 LOW LINE _ _ Pc None
    96 GRAVE ACCENT ` ` Sk None
    97 LATIN SMALL LETTER A a a Ll None
    98 LATIN SMALL LETTER B b b Ll None
    'Ll' == Letter lower

    124 VERTICAL LINE | | Sm None
    125 RIGHT CURLY BRACKET } } Pe None
    126 TILDE ~ ~ Sm None
    160 NO-BREAK SPACE     Zs None
    161 INVERTED EXCLAMATION MARK ¡ ¡ Po None

    What a gap !

    245 LATIN SMALL LETTER O WITH TILDE o? õ Ll None
    246 LATIN SMALL LETTER O WITH DIAERESIS o? ö Ll None
    247 DIVISION SIGN ÷ ÷ Sm None
    248 LATIN SMALL LETTER O WITH STROKE ø ø Ll None

    'Sm' should read 'sign mathematics' ?

    I think that such code snippets should be included in the documentation
    or in a Wiki.

    Regards

    Sorry for bad english, I'm not a native speaker.
     
    F. Petitjean, Apr 20, 2005
    #1
    1. Advertisements

  2. F. Petitjean

    John Machin Guest

    You're not alone there. But I don't expect the docs for the Python
    implementation of Unicode to explain the concepts and vocabulary of
    Unicode. That's the job of the Unicode consortium, and they do a
    not-unreasonable job of it; see www.unicode.org and in particular

    http://www.unicode.org/Public/UNIDATA/UCD.html

    explains all the things that the Python unicodedata module is
    implementing.

    Any effort should be directed (IMESHO) towards (a) keeping the URL in
    the Python documentation up-to-date [it's not] (b) using the *LATEST*
    version of the ucd file when each version of Python is released [still
    stuck on 3.2.0 when the current version available from Unicode.org is
    4.1.0]

    [Exit, pursued by a bear.]
    [Noises off.]

    OK OK don't hit me, Martin, how about instructions on how to DIY,
    then?

    Cheers,
    John
     
    John Machin, Apr 20, 2005
    #2
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.