ZipFile - file adding API incomplete?

G

Glenn Maynard

I want to do something fairly simple: read files from one ZIP and add
them to another, so I can remove and replace files. This led me to a
couple things that seem to be missing from the API.

The simple approach would be to open each file in the source ZIP, and
hand it off to newzip.write(). There's a missing piece, though:
there's no API that lets me pass in a file-like object and a ZipInfo,
to preserve metadata. zip.write() only takes the filename and
compression method, not a ZipInfo; writestr takes a ZipInfo but only
accepts a string, not a file. Is there an API call I'm missing?
(This seems like the fundamental API for adding files, that write and
writestr should be calling.)

The correct approach is to copy the data directly, so it's not
recompressed. This would need two new API calls: rawopen(), acting
like open() but returning a direct file slice and not decompressing
data; and rawwrite(zinfo, file), to pass in pre-compressed data, where
the compression method in zinfo matches the compression type used.

I was surprised that I couldn't find the former. The latter is an
advanced one, important for implementing any tool that modifies large
ZIPs. Short-term, at least, I'll probably implement these externally.
 
D

Diez B. Roggisch

Glenn said:
I want to do something fairly simple: read files from one ZIP and add
them to another, so I can remove and replace files. This led me to a
couple things that seem to be missing from the API.

The simple approach would be to open each file in the source ZIP, and
hand it off to newzip.write(). There's a missing piece, though:
there's no API that lets me pass in a file-like object and a ZipInfo,
to preserve metadata. zip.write() only takes the filename and
compression method, not a ZipInfo; writestr takes a ZipInfo but only
accepts a string, not a file. Is there an API call I'm missing?
(This seems like the fundamental API for adding files, that write and
writestr should be calling.)

The correct approach is to copy the data directly, so it's not
recompressed. This would need two new API calls: rawopen(), acting
like open() but returning a direct file slice and not decompressing
data; and rawwrite(zinfo, file), to pass in pre-compressed data, where
the compression method in zinfo matches the compression type used.

I was surprised that I couldn't find the former. The latter is an
advanced one, important for implementing any tool that modifies large
ZIPs. Short-term, at least, I'll probably implement these externally.

No idea why the write doesn't accept an open file - OTOH, as passing a
string is just


writestr(info, in_file.read())


I don't think that's *that* much of an inconvenience..

And regarding your second idea: can that really work? Intuitively, I
would have thought that compression is adaptive, and based on prior
additions to the file. I might be wrong with this though.

Diez
 
D

Dave Angel

Diez said:
<snip>

And regarding your second idea: can that really work? Intuitively, I
would have thought that compression is adaptive, and based on prior
additions to the file. I might be wrong with this though.
I'm pretty sure that the ZIP format uses independent compression for
each contained file (member). You can add and remove members from an
existing ZIP, and use several different compression methods within the
same file. So the adaptive tables start over for each new member.

What isn't so convenient is that the sizes are apparently at the end.
So if you're trying to unzip "over the wire" you can't readily do it
without somehow seeking to the end. That same feature is a good thing
when it comes to spanning zip files across multiple disks.

The zip file format is documented on the net, but I haven't read the
spec in at least 15 years.

DaveA
 
G

Glenn Maynard

I'm pretty sure that the ZIP format uses independent compression for each
contained file (member).  You can add and remove members from an existing
ZIP, and use several different compression methods within the same file.  So
the adaptive tables start over for each new member.

This is correct. It doesn't do solid compression, which is what you
get with .tar.gz (and RARs, optionally).
What isn't so convenient is that the sizes are apparently at the end.  So if
you're trying to unzip "over the wire" you can't readily do it without
somehow seeking to the end.  That same feature is a good thing when it comes
to spanning zip files across multiple disks.

Actually, there are two copies of the headers: one immediately before
the file data (the local file header), and one at the end (the central
directory); both contain copies of the compressed and uncompressed
file size. Very few programs actually use the local file headers, but
it's very nice to have the option. It also helps makes ZIPs very
recoverable. If you've ever run a ZIP recovery tool, they're usually
just reconstructing the central directory from the local file headers
(and probably recomputing the CRCs).

(This is no longer true if bit 3 of the bitflags is set, which puts
the CRC and filesizes after the data. In that case, it's not possible
to stream data--largely defeating the benefit of the local headers.)
Define a calls to read _portions_ of the raw (compressed, encrypted, whatever) data.

I think the clean way is to return a file-like object for a specified file, eg.:

# Read raw bytes 1024-1152 from each file in the ZIP:
zip = ZipFile("file.zip", "r")
for info in zip.infolist():
f = zip.rawopen(info) # or a filename
f.seek(1024)
f.read(128)
Define a call that locks the ZipFile object and returns a write handle for a single new file.

I'd use a file-like object here, too, for probably obvious
reasons--you can pass it to anything expecting a file object to write
data to (eg. shutil.copyfile).
Only on successful close of the "write handle" is the new directory written.

Rather, when the new file is closed, its directory entry is saved to
ZipFile.filelist. The new directory on disk should be written when
the zip's own close() method is called, just as when writing files
with the other methods. Otherwise, writing lots of files in this way
would write and overwrite the central directory repeatedly.

Any thoughts about this rough API outline:

ZipFile.rawopen(zinfo_or_arcname)
Same definition as open(), but returns the raw data. No mode (no
newline translation for raw files); no pwd (raw files aren't
decrypted).

ZipFile.writefile(zinfo[, raw])
Definition like ZipInfo.writestr. Relax writestr()'s "at least the
filename, date, and time must be given" rule: if not specified, use
the current date and time. Returns a file-like object (ZipWriteFile)
which file data is written to. If raw is True, no actual compression
is performed, and the file data should already be compressed with the
specified compression type (no checking is performed). If raw is
False (the default), the data will be compressed before being written.
When finished writing data, the file must be closed. Only one
ZipWriteFile may be open for each ZipFile at a time. Calls to
ZipFile.writefile while a ZipWriteFile is already open will result in
ValueError[1].

Another detail: is the CRC recomputed when writing in raw mode? No.
If I delete a file from a ZIP (causing me to rewrite the ZIP) and
another file in the ZIP is corrupt, it should just move the file
as-is, invalid CRC and all; it should not rewrite the file with a new
CRC (masking the corruption) or throw an error (I should not get
errors about file X being corrupt if I'm deleting file Y). When
writing in raw mode, if zinfo.CRC is already specified (not None), it
should be used as-is.

I don't like how this results in three different APIs for adding data
(write, writestr, writefile), but trying to squeeze the APIs together
feels unnatural--the parameters don't really line up too well. I'd
expect the other two to become thin wrappers around
ZipFile.writefile(). This never opens files directly like
ZipFile.write, so it only takes a zinfo and not a filename (set the
filename through the ZipInfo).

Now you can stream data into a ZIP, specify all metadata for the file,
and you can stream in compressed data from another ZIP (for deleting
files and other cases) without recompressing. This also means you can
do all of these things to encrypted files without the password, and to
files compressed with unknown methods, which is currently impossible.
and I realize that the big flaw in this design is that from the moment you start overwriting the existing master directory until you write
a new master at the end, your do not have a valid zip file.

The same is true when appending to a ZIP with ZipFile.write(); until
it finishes, the file on disk isn't a valid ZIP. That's unavoidable.
Files in the ZIP can still be opened by the existing ZipFile object,
since it keeps the central directory in memory.

For what it's worth, I've written ZIP parsing code several times over
the years (https://svn.stepmania.com/svn/trunk/stepmania/src/RageFileDriverZip.cpp),
so I'm familiar with the more widely-used parts of the file format,
but I havn't dealt with ZIP writing very much. I'm not sure if I'll
have time to get to this soon, but I'll keep thinking about it.

[1] seems odd, but mimicing
http://docs.python.org/library/stdtypes.html#file.close
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top