trying to understand unicode

F

F. Petitjean

Python has a very good support of unicode, utf8, encodings ... But I
have some difficulties with the concepts and the vocabulary. The
documentation is not bad, but for example in reading
http://docs.python.org/lib/module-unicodedata.html
I had a long time to figure out what unicodedata.digit(unichr) would
mean, a simple example is badly lacking.

So I wrote the following script :

#!/usr/bin/env python

"""Example of use of the unicodedata module
http://docs.python.org/lib/module-unicodedata.html
"""

import unicodedata
import sys

# outcodec = 'latin_1'
outcodec = 'iso8859_15'
if len(sys.argv) > 1:
outcodec = sys.argv[1]

for c in range(256):
uc = unichr(c)
uname = unicodedata.name(uc, None)
if uname:
unfd = unicodedata.normalize('NFD', uc).encode(outcodec,
'replace')
unfc = unicodedata.normalize('NFC', uc).encode(outcodec,
'replace')
print str(c).ljust(3), uname.ljust(42), unfd.ljust(2),
unfc.ljust(2), \
unicodedata.category(uc), unicodedata.numeric(uc, None)


and here are some samples of output
44 COMMA , , Po None
45 HYPHEN-MINUS - - Pd None
46 FULL STOP . . Po None
47 SOLIDUS / / Po None
48 DIGIT ZERO 0 0 Nd 0.0
49 DIGIT ONE 1 1 Nd 1.0
50 DIGIT TWO 2 2 Nd 2.0

It seems that 'Nd' category means Numerical digit doh!

64 COMMERCIAL AT @ @ Po None
65 LATIN CAPITAL LETTER A A A Lu None
66 LATIN CAPITAL LETTER B B B Lu None

'Lu' should read 'Letter upper' ?

94 CIRCUMFLEX ACCENT ^ ^ Sk None
95 LOW LINE _ _ Pc None
96 GRAVE ACCENT ` ` Sk None
97 LATIN SMALL LETTER A a a Ll None
98 LATIN SMALL LETTER B b b Ll None
'Ll' == Letter lower

124 VERTICAL LINE | | Sm None
125 RIGHT CURLY BRACKET } } Pe None
126 TILDE ~ ~ Sm None
160 NO-BREAK SPACE     Zs None
161 INVERTED EXCLAMATION MARK ¡ ¡ Po None

What a gap !

245 LATIN SMALL LETTER O WITH TILDE o? õ Ll None
246 LATIN SMALL LETTER O WITH DIAERESIS o? ö Ll None
247 DIVISION SIGN ÷ ÷ Sm None
248 LATIN SMALL LETTER O WITH STROKE ø ø Ll None

'Sm' should read 'sign mathematics' ?

I think that such code snippets should be included in the documentation
or in a Wiki.

Regards

Sorry for bad english, I'm not a native speaker.
 
J

John Machin

Python has a very good support of unicode, utf8, encodings ... But I
have some difficulties with the concepts and the vocabulary.

You're not alone there. But I don't expect the docs for the Python
implementation of Unicode to explain the concepts and vocabulary of
Unicode. That's the job of the Unicode consortium, and they do a
not-unreasonable job of it; see www.unicode.org and in particular

http://www.unicode.org/Public/UNIDATA/UCD.html

explains all the things that the Python unicodedata module is
implementing.

The
documentation is not bad, but for example in reading
http://docs.python.org/lib/module-unicodedata.html
I had a long time to figure out what unicodedata.digit(unichr) would
mean, a simple example is badly lacking.

So I wrote the following script :
[snip]
I think that such code snippets should be included in the documentation
or in a Wiki.

Any effort should be directed (IMESHO) towards (a) keeping the URL in
the Python documentation up-to-date [it's not] (b) using the *LATEST*
version of the ucd file when each version of Python is released [still
stuck on 3.2.0 when the current version available from Unicode.org is
4.1.0]

[Exit, pursued by a bear.]
[Noises off.]

OK OK don't hit me, Martin, how about instructions on how to DIY,
then?

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top