Shrinky-dink Python (also, non-Unicode Python build is broken)

Discussion in 'Python' started by Larry Hastings, Jan 16, 2006.

  1. I'm an indie shareware Windows game developer. In indie shareware
    game development, download size is terribly important; conventional
    wisdom holds that--even today--your download should be 5MB or less.

    I'd like to use Python in my games. However, python24.dll is 1.86MB,
    and zips down to 877k. I can't afford to devote 1/6 of my download
    to just the scripting interpreter; I've got music, and textures, and
    my own crappy code to ship.

    Following a friend's suggestion, as an experiment I downloaded the
    Python 2.4.2 source, then set about stripping out everything I could.
    I removed:
    * Unicode support, including the CJK codecs
    * All doc strings
    * *Every* module written in C
    Now when I build, python24.dll is 570k, and zips down to about 260k.
    But I learned some things on the way.


    First and foremost: turning off Py_USING_UNICODE *breaks the build*
    on Windows. The following list of breakages were all fixed with
    judicious applications of #ifdef Py_USING_UNICODE:
    * The implementation of "multi-byte codecs" (CJK codecs) implicitly
    assumes that they can use all the Unicode facilities. So all the
    files in "Modules/cjkcodecs" fail to build.
    * Obviously, the Unicode string object depends on Unicode support,
    so Objects/unicode* doesn't build.
    * There are several spots in the code that need to handle Unicode
    strings in some slightly special way, and assume Unicode is turned
    on. E.g.:
    * Modules/posixmodule.c, posix__getfullpathname(), line 1745
    * same file, posix_open(), starting on line 5201
    * Objects/fileobject.c, open_the_file(), starting on line 158
    * _winreg.c, Py2Reg(), starting on lines 724 and 777

    In addition, there was one slightly more complicated problem: _winreg.c
    assumes it should call PyUnicode_DecodeMBCS() to turn strings pulled
    from the registry into Unicode strings. I'm not sure what the correct
    thing to do here is; I went with changing the calls from
    PyUnicode_DecodeMBCS() to PyString_FromStringAndSize() for non-Unicode
    builds.

    Of course, it's not the most important thing in the world--after all,
    I'm the first person to even *notice*, right? But it seems a shame
    that
    one can break the build so easily. If it pleases the stewards of
    Python, I would be happy to submit patches that fix the non-"using
    Unicode" build.


    Second of all, the dumb-as-a-bag-of-rocks Windows linker (at least
    the one used by VC++ under MSVS .Net 2003) *links in unused static
    symbols*. If I want to excise the code for a module, it is not
    sufficient to comment-out the relevant _inittab line in config.c.
    Nor does it help if I comment out the "extern" prototype for the
    init function. As far as I can tell, the only way to *really* get
    rid of a module, including all its static functions and static data,
    is to actually *remove all the code* (with comments, or #if, or
    whatnot). What a nosebleed, huh?

    So in order to build my *really* minimal python24.dll, I have to hack
    up the source something fierce. It would be pleasant if the Python
    source code provided an easy facility for turning off modules at
    compile-time. I would be happy to propose something / write a PEP
    / submit patches to do such a thing, if there is a chance that such
    a thing could make it into the official Python source. However, I
    realize that this has terribly limited appeal; that, and the fact
    that Python releases are infrequent, makes me think it's not a
    terrible hardship if I had to re-hack up each new Python release
    by hand.


    Whatcha think, froods?


    /larry/
    Larry Hastings, Jan 16, 2006
    #1
    1. Advertising

  2. Larry Hastings wrote:
    > Of course, it's not the most important thing in the world--after all,
    > I'm the first person to even *notice*, right? But it seems a shame
    > that
    > one can break the build so easily. If it pleases the stewards of
    > Python, I would be happy to submit patches that fix the non-"using
    > Unicode" build.


    There was a recent python-dev thread_ suggesting that we drop support
    for --disable-unicode, mainly I think because no one was willing to
    maintain it. If you're willing to offer patches and some maintenance,
    it probably has a decent chance of acceptance.

    ... _thread:
    http://mail.python.org/pipermail/python-dev/2005-October/056897.html

    > So in order to build my *really* minimal python24.dll, I have to hack
    > up the source something fierce. It would be pleasant if the Python
    > source code provided an easy facility for turning off modules at
    > compile-time. I would be happy to propose something / write a PEP
    > / submit patches to do such a thing, if there is a chance that such
    > a thing could make it into the official Python source. However, I
    > realize that this has terribly limited appeal; that, and the fact
    > that Python releases are infrequent, makes me think it's not a
    > terrible hardship if I had to re-hack up each new Python release
    > by hand.


    My impression is that, for most things like this, python-dev is happy to
    accept the patches *if* someone is willing to commit to maintaining
    them, and they don't make the codebase too much more complex.

    STeVe
    Steven Bethard, Jan 16, 2006
    #2
    1. Advertising

  3. Larry Hastings wrote:

    > First and foremost: turning off Py_USING_UNICODE *breaks the build*
    > on Windows.


    Probably nobody does that nowadays. My own feeling (but I don't have numbers
    for backing it up) is that the biggest size in the .DLL is represented by
    things like the CJK codecs (which are about 800k). I don't think you're
    gaining that much by trying to remove unicode support at all, especially
    since (as you noticed) it's going to be maintenance headhache.

    > Second of all, the dumb-as-a-bag-of-rocks Windows linker (at least
    > the one used by VC++ under MSVS .Net 2003) *links in unused static
    > symbols*. If I want to excise the code for a module, it is not
    > sufficient to comment-out the relevant _inittab line in config.c.
    > Nor does it help if I comment out the "extern" prototype for the
    > init function. As far as I can tell, the only way to *really* get
    > rid of a module, including all its static functions and static data,
    > is to actually *remove all the code* (with comments, or #if, or
    > whatnot). What a nosebleed, huh?


    This is off-topic here, but MSVC linker *can* strip unused symbols, of
    course. Look into /OPT:NOREF.

    > So in order to build my *really* minimal python24.dll, I have to hack
    > up the source something fierce. It would be pleasant if the Python
    > source code provided an easy facility for turning off modules at
    > compile-time. I would be happy to propose something / write a PEP
    > / submit patches to do such a thing, if there is a chance that such
    > a thing could make it into the official Python source. However, I
    > realize that this has terribly limited appeal; that, and the fact
    > that Python releases are infrequent, makes me think it's not a
    > terrible hardship if I had to re-hack up each new Python release
    > by hand.


    You're not the only one complaining about the size of Python .DLL: also
    people developing self-contained programs with tools like PyInstaller or
    py2exe (that is, programs which are supposed to run without Python
    installed) are affected by the lack of a clear policy.

    I myself complained before, especially after Python 2.4 got those ginormous
    CJK codecs within its standard DLL, you can look for the thread in Google.
    The bottom line of that discussion was:

    - The policy about what must be linked within python .dll and what must be
    kept outside should be proposed as a PEP, and it should provide guidelines
    to be applied also for future modules.
    - There will be some opposition to the obvious policy of "keeping the bare
    minimum inside the DLL" because of inefficiencies in the Python build
    system. Specifically, I was told that maintaining modules outside the DLL
    instead of inside the DLL is more burdesome for some reason (which I have
    not investigated), but surely, with a good build system, switching either
    configuration setting should be the matter of changing a single word in a
    single place, with no code changes required.

    Personally, I could find some time to write up a PEP, but surely not to pick
    up a lengthy discussion nor to improve the build system myself. Hence, I
    mostly decided to give up for now and stick with recompiling Python myself.
    The policy I'd propose is that the DLL should contain the minimum set of
    modules needed to run the following Python program:

    -------------------
    print "hello world"
    -------------------

    There's probably some specific exception I'm not aware of, but you get the
    big picture.
    --
    Giovanni Bajo
    Giovanni Bajo, Jan 16, 2006
    #3
  4. Larry Hastings

    LordLaraby Guest

    I myself wonder why python.dll can't just load a companion i18n.dll
    when and if it's called for in the script. Such as by having week
    references to those functions and loading the dll as needed.And
    probably throwing an exception if it can't be loaded. Most of the CJK
    stuff could then be carried in that DLL and in some cases, such as
    py2exe, not even be included because it's not used.

    Just my 2 cents.

    LL
    LordLaraby, Jan 17, 2006
    #4
  5. Larry Hastings

    Neil Hodgson Guest

    Larry Hastings:

    > First and foremost: turning off Py_USING_UNICODE *breaks the build*
    > on Windows. The following list of breakages were all fixed with
    > judicious applications of #ifdef Py_USING_UNICODE:
    > * The implementation of "multi-byte codecs" (CJK codecs) implicitly
    > assumes that they can use all the Unicode facilities. So all the
    > files in "Modules/cjkcodecs" fail to build.
    > * Obviously, the Unicode string object depends on Unicode support,
    > so Objects/unicode* doesn't build.
    > * There are several spots in the code that need to handle Unicode
    > strings in some slightly special way, and assume Unicode is turned
    > on. E.g.:
    > * Modules/posixmodule.c, posix__getfullpathname(), line 1745
    > * same file, posix_open(), starting on line 5201
    > * Objects/fileobject.c, open_the_file(), starting on line 158
    > * _winreg.c, Py2Reg(), starting on lines 724 and 777


    I'm probably responsible for some of the breakage when adding
    Unicode file name support to Python. Windows is a Unicode based
    operating system and I expect Unicode calls will eventually infest the
    code base to a greater extent than currently. Requiring each
    modification that adds a Unicode feature to be safe with
    Py_USING_UNICODE turned off will add to the implementation effort for
    that feature. I'd prefer to drop support for turning off
    Py_USING_UNICODE in Windows specific code. Well, since it is currently
    broken, document that it isn't supported. Other platforms may need to
    continue allowing non Py_USING_UNICODE builds.

    Neil
    Neil Hodgson, Jan 17, 2006
    #5
  6. Larry Hastings

    Neil Hodgson Guest

    Giovanni Bajo:

    > - There will be some opposition to the obvious policy of "keeping the bare
    > minimum inside the DLL" because of inefficiencies in the Python build
    > system.


    It is also non-optimal for those that do want the full set of
    modules as separate files can add overhead for block sizing (both on
    disk and in memory, executables pad out each section to some block
    size), by requiring more load-time inter-module fixups, and by not
    allowing the linker to perform some optimizations. It'd be worthwhile
    seeing if the DLL would speed up or shrink if whole program optimization
    was turned on.

    Neil
    Neil Hodgson, Jan 17, 2006
    #6
  7. Larry Hastings

    Paul McGuire Guest

    "Larry Hastings" <> wrote in message
    news:...
    > Second of all, the dumb-as-a-bag-of-rocks Windows linker (at least
    > the one used by VC++ under MSVS .Net 2003) *links in unused static
    > symbols*. If I want to excise the code for a module, it is not
    > sufficient to comment-out the relevant _inittab line in config.c.
    > Nor does it help if I comment out the "extern" prototype for the
    > init function. As far as I can tell, the only way to *really* get
    > rid of a module, including all its static functions and static data,
    > is to actually *remove all the code* (with comments, or #if, or
    > whatnot). What a nosebleed, huh?
    >


    This may not be a linker issue. There is a C++ switch /Gy that "enables
    function-level linking". That is, without this option enabled, if any
    function in a module needs to be linked, the linker goes ahead and links the
    whole module. I guess this is supposed to be some kind of linker
    optimization. The problem is that the rest of the module may introduce
    additional link dependencies, thus aggravating the problem. Perhaps
    changing the C compiler to use function-level linking could address this
    problem.

    -- Paul
    Paul McGuire, Jan 17, 2006
    #7
  8. There are exactly four non-Unicode build breakages in the Python source
    tree that are Win32-specific. Two are simply a matter of #if, two also
    require new alternative code (calls to PyString_FromStringAndSize()).
    All told, my changes to Win32-specific code to fix Py_USING_UNICODE
    consists of exactly twelve new lines of code.

    As for future development of Windows-specific Python features...
    doesn't that generally happen in modules, rather than the Python
    interpreter, these days? Either in Mark Hammond's pywin32 (what used
    to be called "win32all"), or perhaps done in Python using ctypes.
    There haven't been any changes to the three Windows-specific modules
    (msvcrt, winreg, and winsound) mentioned in any "What's New in Python
    2.x" document, and 2.0 came out more than five years ago.


    /larry/
    Larry Hastings, Jan 17, 2006
    #8
  9. Neil Hodgson wrote:

    >> - There will be some opposition to the obvious policy of "keeping
    >> the bare minimum inside the DLL" because of inefficiencies in the
    >> Python build system.

    >
    > It is also non-optimal for those that do want the full set of
    > modules as separate files can add overhead for block sizing (both on
    > disk and in memory, executables pad out each section to some block
    > size), by requiring more load-time inter-module fixups


    I would be surprised if this showed up in any profile. Importing modules can
    already be slow no matter external stats (see programs like "mercurial" that,
    to win benchmarks with C-compiled counterparts, do lazy imports). As for the
    overhead at the border of blocks, you should be more worried with 800K of CJK
    codecs being loaded in your virtual memory (and not fully swapped out because
    of block sizing) which are totally useless for most applications.

    Anyway, we're picking nits here, but you have a point in being worried. If I
    ever write a PEP, I will produce numbers to show beyond any doubt that there is
    no performance difference.

    > , and by not
    > allowing the linker to perform some optimizations. It'd be worthwhile
    > seeing if the DLL would speed up or shrink if whole program
    > optimization was turned on.


    There is no way whole program optimization can produce any advantage as the
    modules are totally separated and they don't have direct calls that the
    compiler can exploit.
    --
    Giovanni Bajo
    Giovanni Bajo, Jan 17, 2006
    #9
  10. Larry Hastings

    Neil Hodgson Guest

    Larry Hastings:

    > As for future development of Windows-specific Python features...
    > doesn't that generally happen in modules, rather than the Python
    > interpreter, these days? Either in Mark Hammond's pywin32 (what used
    > to be called "win32all"), or perhaps done in Python using ctypes.
    > There haven't been any changes to the three Windows-specific modules
    > (msvcrt, winreg, and winsound) mentioned in any "What's New in Python
    > 2.x" document, and 2.0 came out more than five years ago.


    It is in the built-in modules providing OS features that there
    should be more use of Unicode. Unicode system calls are more accurate
    and have fewer limitations than ANSI system calls. Examples are allowing
    Unicode in sys.argv and os.environ or for file paths where the ANSI
    versions are limited to less than 260 characters.

    Are you willing to monitor and fix new Py_USING_UNICODE issues or
    are you proposing just to produce a patch now and then expect
    contributors to maintain this feature?

    Neil
    Neil Hodgson, Jan 17, 2006
    #10

  11. > Are you willing to monitor and fix new Py_USING_UNICODE issues or
    > are you proposing just to produce a patch now and then expect
    > contributors to maintain this feature?


    Neither, I suppose, or perhaps both. I am proposing to produce a patch
    now which fixes the non-Unicode build under Windows. However, I don't
    expect anything out of other contributors, and I don't set Python
    contribution policy. (Obviously the stewards of the Python tree don't
    care whether contributions break the non-Unicode build. But that's a
    fine policy; after all, they've already got enough to do, and in any
    case I'm the first person to even notice.) If this patch is accepted,
    and some future contribution breaks the non-Unicode build again, and I
    discover the breakage, I might very well create a second patch to
    re-fix it.

    Since I'm seemingly the only person who cares about non-Unicode builds
    on Windows, I suggest this approach would work just fine.


    /larry/
    Larry Hastings, Jan 17, 2006
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Holger Joukl
    Replies:
    5
    Views:
    528
    Ben Finney
    Dec 13, 2006
  2. Asterix
    Replies:
    5
    Views:
    707
    Matt Nordhoff
    Aug 31, 2008
  3. Steven D'Aprano

    Why are "broken iterators" broken?

    Steven D'Aprano, Sep 21, 2008, in forum: Python
    Replies:
    8
    Views:
    645
  4. Cameron Simpson

    Re: Why are "broken iterators" broken?

    Cameron Simpson, Sep 22, 2008, in forum: Python
    Replies:
    0
    Views:
    581
    Cameron Simpson
    Sep 22, 2008
  5. Fredrik Lundh

    Re: Why are "broken iterators" broken?

    Fredrik Lundh, Sep 22, 2008, in forum: Python
    Replies:
    0
    Views:
    599
    Fredrik Lundh
    Sep 22, 2008
Loading...

Share This Page