ZipFile - file adding API incomplete?

Discussion in 'Python' started by Glenn Maynard, Nov 17, 2009.

  1. I want to do something fairly simple: read files from one ZIP and add
    them to another, so I can remove and replace files. This led me to a
    couple things that seem to be missing from the API.

    The simple approach would be to open each file in the source ZIP, and
    hand it off to newzip.write(). There's a missing piece, though:
    there's no API that lets me pass in a file-like object and a ZipInfo,
    to preserve metadata. zip.write() only takes the filename and
    compression method, not a ZipInfo; writestr takes a ZipInfo but only
    accepts a string, not a file. Is there an API call I'm missing?
    (This seems like the fundamental API for adding files, that write and
    writestr should be calling.)

    The correct approach is to copy the data directly, so it's not
    recompressed. This would need two new API calls: rawopen(), acting
    like open() but returning a direct file slice and not decompressing
    data; and rawwrite(zinfo, file), to pass in pre-compressed data, where
    the compression method in zinfo matches the compression type used.

    I was surprised that I couldn't find the former. The latter is an
    advanced one, important for implementing any tool that modifies large
    ZIPs. Short-term, at least, I'll probably implement these externally.

    --
    Glenn Maynard
    Glenn Maynard, Nov 17, 2009
    #1
    1. Advertising

  2. Glenn Maynard schrieb:
    > I want to do something fairly simple: read files from one ZIP and add
    > them to another, so I can remove and replace files. This led me to a
    > couple things that seem to be missing from the API.
    >
    > The simple approach would be to open each file in the source ZIP, and
    > hand it off to newzip.write(). There's a missing piece, though:
    > there's no API that lets me pass in a file-like object and a ZipInfo,
    > to preserve metadata. zip.write() only takes the filename and
    > compression method, not a ZipInfo; writestr takes a ZipInfo but only
    > accepts a string, not a file. Is there an API call I'm missing?
    > (This seems like the fundamental API for adding files, that write and
    > writestr should be calling.)
    >
    > The correct approach is to copy the data directly, so it's not
    > recompressed. This would need two new API calls: rawopen(), acting
    > like open() but returning a direct file slice and not decompressing
    > data; and rawwrite(zinfo, file), to pass in pre-compressed data, where
    > the compression method in zinfo matches the compression type used.
    >
    > I was surprised that I couldn't find the former. The latter is an
    > advanced one, important for implementing any tool that modifies large
    > ZIPs. Short-term, at least, I'll probably implement these externally.


    No idea why the write doesn't accept an open file - OTOH, as passing a
    string is just


    writestr(info, in_file.read())


    I don't think that's *that* much of an inconvenience..

    And regarding your second idea: can that really work? Intuitively, I
    would have thought that compression is adaptive, and based on prior
    additions to the file. I might be wrong with this though.

    Diez
    Diez B. Roggisch, Nov 17, 2009
    #2
    1. Advertising

  3. Glenn Maynard

    Dave Angel Guest

    Diez B. Roggisch wrote:
    > <div class="moz-text-flowed" style="font-family: -moz-fixed">Glenn
    > Maynard schrieb:
    >> I want to do something fairly simple: read files from one ZIP and add
    >> them to another, so I can remove and replace files. This led me to a
    >> couple things that seem to be missing from the API.
    >>
    >> <snip>
    >>
    >> The correct approach is to copy the data directly, so it's not
    >> recompressed. This would need two new API calls: rawopen(), acting
    >> like open() but returning a direct file slice and not decompressing
    >> data; and rawwrite(zinfo, file), to pass in pre-compressed data, where
    >> the compression method in zinfo matches the compression type used.
    >>
    >> I was surprised that I couldn't find the former. The latter is an
    >> advanced one, important for implementing any tool that modifies large
    >> ZIPs. Short-term, at least, I'll probably implement these externally.

    >
    > <snip>
    >
    > And regarding your second idea: can that really work? Intuitively, I
    > would have thought that compression is adaptive, and based on prior
    > additions to the file. I might be wrong with this though.
    >
    >

    I'm pretty sure that the ZIP format uses independent compression for
    each contained file (member). You can add and remove members from an
    existing ZIP, and use several different compression methods within the
    same file. So the adaptive tables start over for each new member.

    What isn't so convenient is that the sizes are apparently at the end.
    So if you're trying to unzip "over the wire" you can't readily do it
    without somehow seeking to the end. That same feature is a good thing
    when it comes to spanning zip files across multiple disks.

    The zip file format is documented on the net, but I haven't read the
    spec in at least 15 years.

    DaveA
    Dave Angel, Nov 17, 2009
    #3
  4. On Tue, Nov 17, 2009 at 9:28 AM, Dave Angel <> wrote:
    > I'm pretty sure that the ZIP format uses independent compression for each
    > contained file (member).  You can add and remove members from an existing
    > ZIP, and use several different compression methods within the same file.  So
    > the adaptive tables start over for each new member.


    This is correct. It doesn't do solid compression, which is what you
    get with .tar.gz (and RARs, optionally).

    > What isn't so convenient is that the sizes are apparently at the end.  So if
    > you're trying to unzip "over the wire" you can't readily do it without
    > somehow seeking to the end.  That same feature is a good thing when it comes
    > to spanning zip files across multiple disks.


    Actually, there are two copies of the headers: one immediately before
    the file data (the local file header), and one at the end (the central
    directory); both contain copies of the compressed and uncompressed
    file size. Very few programs actually use the local file headers, but
    it's very nice to have the option. It also helps makes ZIPs very
    recoverable. If you've ever run a ZIP recovery tool, they're usually
    just reconstructing the central directory from the local file headers
    (and probably recomputing the CRCs).

    (This is no longer true if bit 3 of the bitflags is set, which puts
    the CRC and filesizes after the data. In that case, it's not possible
    to stream data--largely defeating the benefit of the local headers.)

    > Define a calls to read _portions_ of the raw (compressed, encrypted, whatever) data.


    I think the clean way is to return a file-like object for a specified file, eg.:

    # Read raw bytes 1024-1152 from each file in the ZIP:
    zip = ZipFile("file.zip", "r")
    for info in zip.infolist():
    f = zip.rawopen(info) # or a filename
    f.seek(1024)
    f.read(128)

    > Define a call that locks the ZipFile object and returns a write handle for a single new file.


    I'd use a file-like object here, too, for probably obvious
    reasons--you can pass it to anything expecting a file object to write
    data to (eg. shutil.copyfile).

    > Only on successful close of the "write handle" is the new directory written.


    Rather, when the new file is closed, its directory entry is saved to
    ZipFile.filelist. The new directory on disk should be written when
    the zip's own close() method is called, just as when writing files
    with the other methods. Otherwise, writing lots of files in this way
    would write and overwrite the central directory repeatedly.

    Any thoughts about this rough API outline:

    ZipFile.rawopen(zinfo_or_arcname)
    Same definition as open(), but returns the raw data. No mode (no
    newline translation for raw files); no pwd (raw files aren't
    decrypted).

    ZipFile.writefile(zinfo[, raw])
    Definition like ZipInfo.writestr. Relax writestr()'s "at least the
    filename, date, and time must be given" rule: if not specified, use
    the current date and time. Returns a file-like object (ZipWriteFile)
    which file data is written to. If raw is True, no actual compression
    is performed, and the file data should already be compressed with the
    specified compression type (no checking is performed). If raw is
    False (the default), the data will be compressed before being written.
    When finished writing data, the file must be closed. Only one
    ZipWriteFile may be open for each ZipFile at a time. Calls to
    ZipFile.writefile while a ZipWriteFile is already open will result in
    ValueError[1].

    Another detail: is the CRC recomputed when writing in raw mode? No.
    If I delete a file from a ZIP (causing me to rewrite the ZIP) and
    another file in the ZIP is corrupt, it should just move the file
    as-is, invalid CRC and all; it should not rewrite the file with a new
    CRC (masking the corruption) or throw an error (I should not get
    errors about file X being corrupt if I'm deleting file Y). When
    writing in raw mode, if zinfo.CRC is already specified (not None), it
    should be used as-is.

    I don't like how this results in three different APIs for adding data
    (write, writestr, writefile), but trying to squeeze the APIs together
    feels unnatural--the parameters don't really line up too well. I'd
    expect the other two to become thin wrappers around
    ZipFile.writefile(). This never opens files directly like
    ZipFile.write, so it only takes a zinfo and not a filename (set the
    filename through the ZipInfo).

    Now you can stream data into a ZIP, specify all metadata for the file,
    and you can stream in compressed data from another ZIP (for deleting
    files and other cases) without recompressing. This also means you can
    do all of these things to encrypted files without the password, and to
    files compressed with unknown methods, which is currently impossible.

    > and I realize that the big flaw in this design is that from the moment you start overwriting the existing master directory until you write

    a new master at the end, your do not have a valid zip file.

    The same is true when appending to a ZIP with ZipFile.write(); until
    it finishes, the file on disk isn't a valid ZIP. That's unavoidable.
    Files in the ZIP can still be opened by the existing ZipFile object,
    since it keeps the central directory in memory.

    For what it's worth, I've written ZIP parsing code several times over
    the years (https://svn.stepmania.com/svn/trunk/stepmania/src/RageFileDriverZip.cpp),
    so I'm familiar with the more widely-used parts of the file format,
    but I havn't dealt with ZIP writing very much. I'm not sure if I'll
    have time to get to this soon, but I'll keep thinking about it.

    [1] seems odd, but mimicing
    http://docs.python.org/library/stdtypes.html#file.close

    --
    Glenn Maynard
    Glenn Maynard, Nov 18, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. C B
    Replies:
    4
    Views:
    4,549
    Chris Uppal
    Nov 10, 2004
  2. Jim Walseth
    Replies:
    0
    Views:
    292
    Jim Walseth
    Oct 23, 2003
  3. =?windows-1251?b?xeLj5e3o6SDK7vHl7eru?=

    How to fresh or delete a file in azip-archive via zipfile module?

    =?windows-1251?b?xeLj5e3o6SDK7vHl7eru?=, Dec 7, 2003, in forum: Python
    Replies:
    0
    Views:
    302
    =?windows-1251?b?xeLj5e3o6SDK7vHl7eru?=
    Dec 7, 2003
  4. ¯u¤ß

    delete file with zipfile

    ¯u¤ß, Jan 11, 2004, in forum: Python
    Replies:
    1
    Views:
    446
    Michel Claveau/Hamster
    Jan 11, 2004
  5. Replies:
    0
    Views:
    93
Loading...

Share This Page