regenerating unicodedata for py2.7 using py3 makeunicodedata.py?

Vlastimil Brom · Nov 13, 2010

Hi all,
I'd like to ask about a surprising possibility I found while
investigating the new unicode 6.0 standard for use in python.
As python 2 series won't be updated in this regard
( http://bugs.python.org/issue10400 ),
I tried my "poor man's approach" of compiling the needed pyd file with
the recent unicode data (cf. the older post
http://mail.python.org/pipermail/python-list/2010-March/1240002.html )
While checking the changed format, i found to my big surprise, that it
is possible to generate the header files using the py3
makeunicodedata.py
which has already been updated for Unicode 6.0; this is even much more
comfortable than the previous versions, as the needed data are
downloaded automatically.
http://svn.python.org/view/python/b.../makeunicodedata.py?view=markup&pathrev=85371
It turned out, that the resulting headers are accepted by MS Visual
C++ Express along with the py2.7 source files
and that the generated unicodedata.pyd seems to be working work at
least in the cases I tested sofar.

Is this intended or even guaranteed for these generated files to be
compatible across py2.7 and py3, or am I going to be bitten by some
less obvious issues later?

The newly added ranges and characters are available, only in the CJK
Unified Ideographs Extension D the character names are not present
(while categories are), but this appears to be the same in the
original unicodedadata with 5.2 on CJK Unified Ideographs Extension C.
Traceback (most recent call last):

File said:
###########################

Traceback (most recent call last):

Could please anybody confirm, whether this way of updating the
unicodedata for 2.7 is generaly viable or point out possible problem
this may lead to?
Many thanks in advance,
Vlastimil Brom

Martin v. Loewis · Nov 13, 2010

Is this intended or even guaranteed for these generated files to be

compatible across py2.7 and py3, or am I going to be bitten by some
less obvious issues later?

It works because the generated files are just arrays of structures,
and these structures are the same in 2.7 and 3.2. However, there is
no guarantee about this property: you will need to check for changes
to unicodedata.c to see whether they may affect compatibility.

Regards,
Martin

Vlastimil Brom · Nov 13, 2010

2010/11/13 Martin v. Loewis said:
It works because the generated files are just arrays of structures,
and these structures are the same in 2.7 and 3.2. However, there is
no guarantee about this property: you will need to check for changes
to unicodedata.c to see whether they may affect compatibility.

Regards,
Martin

Thanks for the confirmation Martin!

Do you think, it the mentioned omission of the character names of some
CJK ranges in unicodedata intended, or should it be reported to the
tracker?

Regards,
Vlastimil Brom

Martin v. Loewis · Nov 18, 2010

Thanks for the confirmation Martin!

Do you think, it the mentioned omission of the character names of some
CJK ranges in unicodedata intended, or should it be reported to the
tracker?

It's certainly a bug. So a bug report would be appreciated, but much
more so a patch. Ideally, the patch would either be completely
forward-compatible (should the CJK ranges change in future Unicode
versions),
or at least have a safe-guard to detect that the data file is getting
out of sync with the C implementation.

Regards,
Martin

Martin v. Loewis · Nov 18, 2010

Thanks for the confirmation Martin!

Do you think, it the mentioned omission of the character names of some
CJK ranges in unicodedata intended, or should it be reported to the
tracker?

It's certainly a bug. So a bug report would be appreciated, but much
more so a patch. Ideally, the patch would either be completely
forward-compatible (should the CJK ranges change in future Unicode
versions),
or at least have a safe-guard to detect that the data file is getting
out of sync with the C implementation.

Regards,
Martin

Vlastimil Brom · Nov 19, 2010

2010/11/18 Martin v. Loewis said:
It's certainly a bug. So a bug report would be appreciated, but much
more so a patch. Ideally, the patch would either be completely
forward-compatible (should the CJK ranges change in future Unicode
versions),
or at least have a safe-guard to detect that the data file is getting
out of sync with the C implementation.

Regards,
Martin

Thanks,
I just created a bug ticket:
http://bugs.python.org/issue10459

The omissions of character names seem to be:

é¾¼ (0x9fbc) - é¿‹ (0x9fcb)
(CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])

ðªœ€ (0x2a700) - ð«œ´ (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])

ð«€ (0x2b740) - ð« (0x2b81d)
(CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])

(Also the unprintable ASCII controls, Surrogates and Private use area,
where the missing names are probably ok.)

Unfortunately, I am not able to provide a patch, mainly because of
unicodadate being C code.
A while ago I considered writing some unicodedata enhancements in
python, which would support the ranges and script names, full category
names etc., but sofar the direct programatic lookups in the online
unicode docs and with some simple processing also do work
sufficiently...

Regards,
Vlastimil Brom

[perl-python] unicode study with unicodedata module	5	Mar 15, 2005
[ANN] PyYAML-3.10: YAML parser and emitter for Python	0	May 30, 2011
Newbie help for using multiprocessing and subprocess packages forcreating child processes	0	Jun 16, 2009
Cannot allocate memory when using os.spawn for moving files	1	Mar 17, 2009
Problems Compiling Python 2.6.7 for Win7	2	Jun 8, 2011
import serial failure	2	Apr 16, 2014
newbie - merging xls files using xldt and xlwt	5	Oct 15, 2008
Error in Pango while using cairo/librsvg	1	Jun 15, 2009

regenerating unicodedata for py2.7 using py3 makeunicodedata.py?

Vlastimil Brom

Martin v. Loewis

Vlastimil Brom

Martin v. Loewis

Martin v. Loewis

Vlastimil Brom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads