Shrinky-dink Python (also, non-Unicode Python build is broken)

L

Larry Hastings

I'm an indie shareware Windows game developer. In indie shareware
game development, download size is terribly important; conventional
wisdom holds that--even today--your download should be 5MB or less.

I'd like to use Python in my games. However, python24.dll is 1.86MB,
and zips down to 877k. I can't afford to devote 1/6 of my download
to just the scripting interpreter; I've got music, and textures, and
my own crappy code to ship.

Following a friend's suggestion, as an experiment I downloaded the
Python 2.4.2 source, then set about stripping out everything I could.
I removed:
* Unicode support, including the CJK codecs
* All doc strings
* *Every* module written in C
Now when I build, python24.dll is 570k, and zips down to about 260k.
But I learned some things on the way.


First and foremost: turning off Py_USING_UNICODE *breaks the build*
on Windows. The following list of breakages were all fixed with
judicious applications of #ifdef Py_USING_UNICODE:
* The implementation of "multi-byte codecs" (CJK codecs) implicitly
assumes that they can use all the Unicode facilities. So all the
files in "Modules/cjkcodecs" fail to build.
* Obviously, the Unicode string object depends on Unicode support,
so Objects/unicode* doesn't build.
* There are several spots in the code that need to handle Unicode
strings in some slightly special way, and assume Unicode is turned
on. E.g.:
* Modules/posixmodule.c, posix__getfullpathname(), line 1745
* same file, posix_open(), starting on line 5201
* Objects/fileobject.c, open_the_file(), starting on line 158
* _winreg.c, Py2Reg(), starting on lines 724 and 777

In addition, there was one slightly more complicated problem: _winreg.c
assumes it should call PyUnicode_DecodeMBCS() to turn strings pulled
from the registry into Unicode strings. I'm not sure what the correct
thing to do here is; I went with changing the calls from
PyUnicode_DecodeMBCS() to PyString_FromStringAndSize() for non-Unicode
builds.

Of course, it's not the most important thing in the world--after all,
I'm the first person to even *notice*, right? But it seems a shame
that
one can break the build so easily. If it pleases the stewards of
Python, I would be happy to submit patches that fix the non-"using
Unicode" build.


Second of all, the dumb-as-a-bag-of-rocks Windows linker (at least
the one used by VC++ under MSVS .Net 2003) *links in unused static
symbols*. If I want to excise the code for a module, it is not
sufficient to comment-out the relevant _inittab line in config.c.
Nor does it help if I comment out the "extern" prototype for the
init function. As far as I can tell, the only way to *really* get
rid of a module, including all its static functions and static data,
is to actually *remove all the code* (with comments, or #if, or
whatnot). What a nosebleed, huh?

So in order to build my *really* minimal python24.dll, I have to hack
up the source something fierce. It would be pleasant if the Python
source code provided an easy facility for turning off modules at
compile-time. I would be happy to propose something / write a PEP
/ submit patches to do such a thing, if there is a chance that such
a thing could make it into the official Python source. However, I
realize that this has terribly limited appeal; that, and the fact
that Python releases are infrequent, makes me think it's not a
terrible hardship if I had to re-hack up each new Python release
by hand.


Whatcha think, froods?


/larry/
 
S

Steven Bethard

Larry said:
Of course, it's not the most important thing in the world--after all,
I'm the first person to even *notice*, right? But it seems a shame
that
one can break the build so easily. If it pleases the stewards of
Python, I would be happy to submit patches that fix the non-"using
Unicode" build.

There was a recent python-dev thread_ suggesting that we drop support
for --disable-unicode, mainly I think because no one was willing to
maintain it. If you're willing to offer patches and some maintenance,
it probably has a decent chance of acceptance.

... _thread:
http://mail.python.org/pipermail/python-dev/2005-October/056897.html
So in order to build my *really* minimal python24.dll, I have to hack
up the source something fierce. It would be pleasant if the Python
source code provided an easy facility for turning off modules at
compile-time. I would be happy to propose something / write a PEP
/ submit patches to do such a thing, if there is a chance that such
a thing could make it into the official Python source. However, I
realize that this has terribly limited appeal; that, and the fact
that Python releases are infrequent, makes me think it's not a
terrible hardship if I had to re-hack up each new Python release
by hand.

My impression is that, for most things like this, python-dev is happy to
accept the patches *if* someone is willing to commit to maintaining
them, and they don't make the codebase too much more complex.

STeVe
 
G

Giovanni Bajo

Larry said:
First and foremost: turning off Py_USING_UNICODE *breaks the build*
on Windows.

Probably nobody does that nowadays. My own feeling (but I don't have numbers
for backing it up) is that the biggest size in the .DLL is represented by
things like the CJK codecs (which are about 800k). I don't think you're
gaining that much by trying to remove unicode support at all, especially
since (as you noticed) it's going to be maintenance headhache.
Second of all, the dumb-as-a-bag-of-rocks Windows linker (at least
the one used by VC++ under MSVS .Net 2003) *links in unused static
symbols*. If I want to excise the code for a module, it is not
sufficient to comment-out the relevant _inittab line in config.c.
Nor does it help if I comment out the "extern" prototype for the
init function. As far as I can tell, the only way to *really* get
rid of a module, including all its static functions and static data,
is to actually *remove all the code* (with comments, or #if, or
whatnot). What a nosebleed, huh?

This is off-topic here, but MSVC linker *can* strip unused symbols, of
course. Look into /OPT:NOREF.
So in order to build my *really* minimal python24.dll, I have to hack
up the source something fierce. It would be pleasant if the Python
source code provided an easy facility for turning off modules at
compile-time. I would be happy to propose something / write a PEP
/ submit patches to do such a thing, if there is a chance that such
a thing could make it into the official Python source. However, I
realize that this has terribly limited appeal; that, and the fact
that Python releases are infrequent, makes me think it's not a
terrible hardship if I had to re-hack up each new Python release
by hand.

You're not the only one complaining about the size of Python .DLL: also
people developing self-contained programs with tools like PyInstaller or
py2exe (that is, programs which are supposed to run without Python
installed) are affected by the lack of a clear policy.

I myself complained before, especially after Python 2.4 got those ginormous
CJK codecs within its standard DLL, you can look for the thread in Google.
The bottom line of that discussion was:

- The policy about what must be linked within python .dll and what must be
kept outside should be proposed as a PEP, and it should provide guidelines
to be applied also for future modules.
- There will be some opposition to the obvious policy of "keeping the bare
minimum inside the DLL" because of inefficiencies in the Python build
system. Specifically, I was told that maintaining modules outside the DLL
instead of inside the DLL is more burdesome for some reason (which I have
not investigated), but surely, with a good build system, switching either
configuration setting should be the matter of changing a single word in a
single place, with no code changes required.

Personally, I could find some time to write up a PEP, but surely not to pick
up a lengthy discussion nor to improve the build system myself. Hence, I
mostly decided to give up for now and stick with recompiling Python myself.
The policy I'd propose is that the DLL should contain the minimum set of
modules needed to run the following Python program:
 
L

LordLaraby

I myself wonder why python.dll can't just load a companion i18n.dll
when and if it's called for in the script. Such as by having week
references to those functions and loading the dll as needed.And
probably throwing an exception if it can't be loaded. Most of the CJK
stuff could then be carried in that DLL and in some cases, such as
py2exe, not even be included because it's not used.

Just my 2 cents.

LL
 
N

Neil Hodgson

Larry Hastings:
First and foremost: turning off Py_USING_UNICODE *breaks the build*
on Windows. The following list of breakages were all fixed with
judicious applications of #ifdef Py_USING_UNICODE:
* The implementation of "multi-byte codecs" (CJK codecs) implicitly
assumes that they can use all the Unicode facilities. So all the
files in "Modules/cjkcodecs" fail to build.
* Obviously, the Unicode string object depends on Unicode support,
so Objects/unicode* doesn't build.
* There are several spots in the code that need to handle Unicode
strings in some slightly special way, and assume Unicode is turned
on. E.g.:
* Modules/posixmodule.c, posix__getfullpathname(), line 1745
* same file, posix_open(), starting on line 5201
* Objects/fileobject.c, open_the_file(), starting on line 158
* _winreg.c, Py2Reg(), starting on lines 724 and 777

I'm probably responsible for some of the breakage when adding
Unicode file name support to Python. Windows is a Unicode based
operating system and I expect Unicode calls will eventually infest the
code base to a greater extent than currently. Requiring each
modification that adds a Unicode feature to be safe with
Py_USING_UNICODE turned off will add to the implementation effort for
that feature. I'd prefer to drop support for turning off
Py_USING_UNICODE in Windows specific code. Well, since it is currently
broken, document that it isn't supported. Other platforms may need to
continue allowing non Py_USING_UNICODE builds.

Neil
 
N

Neil Hodgson

Giovanni Bajo:
- There will be some opposition to the obvious policy of "keeping the bare
minimum inside the DLL" because of inefficiencies in the Python build
system.

It is also non-optimal for those that do want the full set of
modules as separate files can add overhead for block sizing (both on
disk and in memory, executables pad out each section to some block
size), by requiring more load-time inter-module fixups, and by not
allowing the linker to perform some optimizations. It'd be worthwhile
seeing if the DLL would speed up or shrink if whole program optimization
was turned on.

Neil
 
P

Paul McGuire

Larry Hastings said:
Second of all, the dumb-as-a-bag-of-rocks Windows linker (at least
the one used by VC++ under MSVS .Net 2003) *links in unused static
symbols*. If I want to excise the code for a module, it is not
sufficient to comment-out the relevant _inittab line in config.c.
Nor does it help if I comment out the "extern" prototype for the
init function. As far as I can tell, the only way to *really* get
rid of a module, including all its static functions and static data,
is to actually *remove all the code* (with comments, or #if, or
whatnot). What a nosebleed, huh?

This may not be a linker issue. There is a C++ switch /Gy that "enables
function-level linking". That is, without this option enabled, if any
function in a module needs to be linked, the linker goes ahead and links the
whole module. I guess this is supposed to be some kind of linker
optimization. The problem is that the rest of the module may introduce
additional link dependencies, thus aggravating the problem. Perhaps
changing the C compiler to use function-level linking could address this
problem.

-- Paul
 
L

Larry Hastings

There are exactly four non-Unicode build breakages in the Python source
tree that are Win32-specific. Two are simply a matter of #if, two also
require new alternative code (calls to PyString_FromStringAndSize()).
All told, my changes to Win32-specific code to fix Py_USING_UNICODE
consists of exactly twelve new lines of code.

As for future development of Windows-specific Python features...
doesn't that generally happen in modules, rather than the Python
interpreter, these days? Either in Mark Hammond's pywin32 (what used
to be called "win32all"), or perhaps done in Python using ctypes.
There haven't been any changes to the three Windows-specific modules
(msvcrt, winreg, and winsound) mentioned in any "What's New in Python
2.x" document, and 2.0 came out more than five years ago.


/larry/
 
G

Giovanni Bajo

Neil said:
It is also non-optimal for those that do want the full set of
modules as separate files can add overhead for block sizing (both on
disk and in memory, executables pad out each section to some block
size), by requiring more load-time inter-module fixups

I would be surprised if this showed up in any profile. Importing modules can
already be slow no matter external stats (see programs like "mercurial" that,
to win benchmarks with C-compiled counterparts, do lazy imports). As for the
overhead at the border of blocks, you should be more worried with 800K of CJK
codecs being loaded in your virtual memory (and not fully swapped out because
of block sizing) which are totally useless for most applications.

Anyway, we're picking nits here, but you have a point in being worried. If I
ever write a PEP, I will produce numbers to show beyond any doubt that there is
no performance difference.
, and by not
allowing the linker to perform some optimizations. It'd be worthwhile
seeing if the DLL would speed up or shrink if whole program
optimization was turned on.

There is no way whole program optimization can produce any advantage as the
modules are totally separated and they don't have direct calls that the
compiler can exploit.
 
N

Neil Hodgson

Larry Hastings:
As for future development of Windows-specific Python features...
doesn't that generally happen in modules, rather than the Python
interpreter, these days? Either in Mark Hammond's pywin32 (what used
to be called "win32all"), or perhaps done in Python using ctypes.
There haven't been any changes to the three Windows-specific modules
(msvcrt, winreg, and winsound) mentioned in any "What's New in Python
2.x" document, and 2.0 came out more than five years ago.

It is in the built-in modules providing OS features that there
should be more use of Unicode. Unicode system calls are more accurate
and have fewer limitations than ANSI system calls. Examples are allowing
Unicode in sys.argv and os.environ or for file paths where the ANSI
versions are limited to less than 260 characters.

Are you willing to monitor and fix new Py_USING_UNICODE issues or
are you proposing just to produce a patch now and then expect
contributors to maintain this feature?

Neil
 
L

Larry Hastings

Are you willing to monitor and fix new Py_USING_UNICODE issues or
are you proposing just to produce a patch now and then expect
contributors to maintain this feature?

Neither, I suppose, or perhaps both. I am proposing to produce a patch
now which fixes the non-Unicode build under Windows. However, I don't
expect anything out of other contributors, and I don't set Python
contribution policy. (Obviously the stewards of the Python tree don't
care whether contributions break the non-Unicode build. But that's a
fine policy; after all, they've already got enough to do, and in any
case I'm the first person to even notice.) If this patch is accepted,
and some future contribution breaks the non-Unicode build again, and I
discover the breakage, I might very well create a second patch to
re-fix it.

Since I'm seemingly the only person who cares about non-Unicode builds
on Windows, I suggest this approach would work just fine.


/larry/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top