F
F. Petitjean
Python has a very good support of unicode, utf8, encodings ... But I
have some difficulties with the concepts and the vocabulary. The
documentation is not bad, but for example in reading
http://docs.python.org/lib/module-unicodedata.html
I had a long time to figure out what unicodedata.digit(unichr) would
mean, a simple example is badly lacking.
So I wrote the following script :
#!/usr/bin/env python
"""Example of use of the unicodedata module
http://docs.python.org/lib/module-unicodedata.html
"""
import unicodedata
import sys
# outcodec = 'latin_1'
outcodec = 'iso8859_15'
if len(sys.argv) > 1:
outcodec = sys.argv[1]
for c in range(256):
uc = unichr(c)
uname = unicodedata.name(uc, None)
if uname:
unfd = unicodedata.normalize('NFD', uc).encode(outcodec,
'replace')
unfc = unicodedata.normalize('NFC', uc).encode(outcodec,
'replace')
print str(c).ljust(3), uname.ljust(42), unfd.ljust(2),
unfc.ljust(2), \
unicodedata.category(uc), unicodedata.numeric(uc, None)
and here are some samples of output
44 COMMA , , Po None
45 HYPHEN-MINUS - - Pd None
46 FULL STOP . . Po None
47 SOLIDUS / / Po None
48 DIGIT ZERO 0 0 Nd 0.0
49 DIGIT ONE 1 1 Nd 1.0
50 DIGIT TWO 2 2 Nd 2.0
It seems that 'Nd' category means Numerical digit doh!
64 COMMERCIAL AT @ @ Po None
65 LATIN CAPITAL LETTER A A A Lu None
66 LATIN CAPITAL LETTER B B B Lu None
'Lu' should read 'Letter upper' ?
94 CIRCUMFLEX ACCENT ^ ^ Sk None
95 LOW LINE _ _ Pc None
96 GRAVE ACCENT ` ` Sk None
97 LATIN SMALL LETTER A a a Ll None
98 LATIN SMALL LETTER B b b Ll None
'Ll' == Letter lower
124 VERTICAL LINE | | Sm None
125 RIGHT CURLY BRACKET } } Pe None
126 TILDE ~ ~ Sm None
160 NO-BREAK SPACE Zs None
161 INVERTED EXCLAMATION MARK ¡ ¡ Po None
What a gap !
245 LATIN SMALL LETTER O WITH TILDE o? õ Ll None
246 LATIN SMALL LETTER O WITH DIAERESIS o? ö Ll None
247 DIVISION SIGN ÷ ÷ Sm None
248 LATIN SMALL LETTER O WITH STROKE ø ø Ll None
'Sm' should read 'sign mathematics' ?
I think that such code snippets should be included in the documentation
or in a Wiki.
Regards
Sorry for bad english, I'm not a native speaker.
have some difficulties with the concepts and the vocabulary. The
documentation is not bad, but for example in reading
http://docs.python.org/lib/module-unicodedata.html
I had a long time to figure out what unicodedata.digit(unichr) would
mean, a simple example is badly lacking.
So I wrote the following script :
#!/usr/bin/env python
"""Example of use of the unicodedata module
http://docs.python.org/lib/module-unicodedata.html
"""
import unicodedata
import sys
# outcodec = 'latin_1'
outcodec = 'iso8859_15'
if len(sys.argv) > 1:
outcodec = sys.argv[1]
for c in range(256):
uc = unichr(c)
uname = unicodedata.name(uc, None)
if uname:
unfd = unicodedata.normalize('NFD', uc).encode(outcodec,
'replace')
unfc = unicodedata.normalize('NFC', uc).encode(outcodec,
'replace')
print str(c).ljust(3), uname.ljust(42), unfd.ljust(2),
unfc.ljust(2), \
unicodedata.category(uc), unicodedata.numeric(uc, None)
and here are some samples of output
44 COMMA , , Po None
45 HYPHEN-MINUS - - Pd None
46 FULL STOP . . Po None
47 SOLIDUS / / Po None
48 DIGIT ZERO 0 0 Nd 0.0
49 DIGIT ONE 1 1 Nd 1.0
50 DIGIT TWO 2 2 Nd 2.0
It seems that 'Nd' category means Numerical digit doh!
64 COMMERCIAL AT @ @ Po None
65 LATIN CAPITAL LETTER A A A Lu None
66 LATIN CAPITAL LETTER B B B Lu None
'Lu' should read 'Letter upper' ?
94 CIRCUMFLEX ACCENT ^ ^ Sk None
95 LOW LINE _ _ Pc None
96 GRAVE ACCENT ` ` Sk None
97 LATIN SMALL LETTER A a a Ll None
98 LATIN SMALL LETTER B b b Ll None
'Ll' == Letter lower
124 VERTICAL LINE | | Sm None
125 RIGHT CURLY BRACKET } } Pe None
126 TILDE ~ ~ Sm None
160 NO-BREAK SPACE Zs None
161 INVERTED EXCLAMATION MARK ¡ ¡ Po None
What a gap !
245 LATIN SMALL LETTER O WITH TILDE o? õ Ll None
246 LATIN SMALL LETTER O WITH DIAERESIS o? ö Ll None
247 DIVISION SIGN ÷ ÷ Sm None
248 LATIN SMALL LETTER O WITH STROKE ø ø Ll None
'Sm' should read 'sign mathematics' ?
I think that such code snippets should be included in the documentation
or in a Wiki.
Regards
Sorry for bad english, I'm not a native speaker.