pruning a Zip

R

Roedy Green

I compress a large number of files to create a ZIP backup on DVD and
USB flash drive. I do it from scratch each day to avoid accumulating
deadwood.

I have repeatedly asked the WinZip people to provide a way to prune
ZIP files of members that no longer exist. They have always refused
saying I was the only one asking. If they would comply, I could do an
update/prune rather than compressing the entire set from scratch, a
very time-consuming process.

I thought to myself. I need to write a utility to do this. I started
thinking about how to do it efficiently. Here are some of the ideas I
had:

1. create a list of files that need to be pruned and feed it to WinZip
as an @list and trust it to prune efficiently.

2. Use the JAVA zip API to delete, and trust it to wait until a batch
of deletes are all specified before writing a new more compact file
and renaming. (I have not experimented with it to see how bright it
is about multiple deletes from a very large zip file.)

3. Study up on the Zip file format, create new deleted file with just
the kept members and a freshly composed index at the end. This would
be guaranteed to be efficient, would be the most fun, but would take
the longest to code.

Any thoughts?

--
Roedy Green Canadian Mind Products
http://mindprod.com

"Species evolve exactly as if they were adapting as best they could to a changing world, and not at all as if they were moving toward a set goal."
~ George Gaylord Simpson
 
A

Andrew Thompson

I compress a large number of files to create a ZIP backup on DVD and
USB flash drive. I do it from scratch each day to avoid accumulating
deadwood.

With the price of bytes on permanent/removable media
so low, I would tend to go for an Ant task to check
the file updates/deletes required, then copy them
entirely uncompressed (says he who has a 320 Gig IDE
drive as a 'portable backup tool').

YMMV
 
R

Roedy Green


this looks very promising. The ZIP just looks like a directory. You
can do your normal things to the "files" in it, and when you are done
you call update, which then constructs a new zip with all your
changes.

The thing I wanted to strenuously avoid was creating a new zip file
on every delete, since it would require copying the entire archive
every time.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Species evolve exactly as if they were adapting as best they could to a changing world, and not at all as if they were moving toward a set goal."
~ George Gaylord Simpson
 
A

Andreas Leitgeb

Roedy Green said:
The thing I wanted to strenuously avoid was creating a new zip file
on every delete, since it would require copying the entire archive
every time.

I don't think you'll get around *copying* the contents of the zip-file
around, or otherwise you'll add up holes in the file.

I guess it's already a big win, if those files already compressed do
not need to be re-compressed, and the copying takes place only once
for a whole bunch of removed files, rather than for each one.
 
R

Roedy Green

I don't think you'll get around *copying* the contents of the zip-file
around, or otherwise you'll add up holes in the file.

Doing it all at once at the end the way TrueZip does is a big win for
multiple deletes. You copy approximately N bytes where N is the size
of the final Zip.

If you did it the flat footed way and collapsed the zip on every
delete, you would end up copying roughly N * d bytes where d is the
number of deletes.

I don't know if Winzip or Java is flat-footed. N for one of my files
is 260 MB, so it makes a big difference.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Species evolve exactly as if they were adapting as best they could to a changing world, and not at all as if they were moving toward a set goal."
~ George Gaylord Simpson
 
A

Arne Vajhøj

Roedy said:
Doing it all at once at the end the way TrueZip does is a big win for
multiple deletes. You copy approximately N bytes where N is the size
of the final Zip.

If you did it the flat footed way and collapsed the zip on every
delete, you would end up copying roughly N * d bytes where d is the
number of deletes.

I don't know if Winzip or Java is flat-footed. N for one of my files
is 260 MB, so it makes a big difference.

I don't even think java.util.zip supports delete.

Arne
 
W

Wojtek

Roedy Green wrote :
I compress a large number of files to create a ZIP backup on DVD and
USB flash drive. I do it from scratch each day to avoid accumulating
deadwood.

I have repeatedly asked the WinZip people to provide a way to prune
ZIP files of members that no longer exist. They have always refused
saying I was the only one asking. If they would comply, I could do an
update/prune rather than compressing the entire set from scratch, a
very time-consuming process.

Do you need a single file which is zipped? Or is a bunch of files which
are zipped ok?

Scan through your source files and the zipped files (same dir
structure), and delete any zip files which do not have a corresponding
source file. At the same time, zip up any source files which do not
have a file in the zip structure.
 
R

Roedy Green

I don't even think java.util.zip supports delete.

I could not find anything. There was OPEN_DELETE but the
documentation made no sense.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 
R

Roedy Green

Scan through your source files and the zipped files (same dir
structure), and delete any zip files which do not have a corresponding
source file. At the same time, zip up any source files which do not
have a file in the zip structure.

This is what I have done using TrueZip which supports bulk deletes
efficiently. The code is about 80% done. So far it maintains a
directory tree. Now I need to insert the TrueZip API which simulates
the directory tree with an Archive file.

It works by producing two sorted arrays of Files:

1. what should be in the file

2. what is in the archive already

It does a merge doing an add, update, delete or nothing depending on
the matching.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 
W

Wojtek

Roedy Green wrote :
This is what I have done using TrueZip which supports bulk deletes
efficiently. The code is about 80% done. So far it maintains a
directory tree. Now I need to insert the TrueZip API which simulates
the directory tree with an Archive file.

It works by producing two sorted arrays of Files:

1. what should be in the file

2. what is in the archive already

It does a merge doing an add, update, delete or nothing depending on
the matching.

So effectively you are doing a mirror, except with a ZIP as the target.

I know that a ZIP file has some overhead associated with its internal
tables, but is the added space worth the effort of keeping a single
file?

So a single ZIP file contains X + ZIP table bytes, whereas a directory
tree of ZIP files contains file count * (X + ZIP table) bytes. Plus you
have the added overhead of creating/deleteing file allocation entries.

Then you have the processing time of creating a single file (add/delete
entries) vs creating a number of smaller files.

I back up my source code each night. The total size is around 500M with
thousands of small files. I use a windows program named ROBOCOPY which
does a mirror copy (it deletes deleted files, copies newer/new files).
Usually about 5M of data is copied, but all of it is scanned for file
date/time. This takes around 2 minutes to complete over a 100M LAN to a
server.
 
R

Roedy Green

Roedy Green wrote :

So effectively you are doing a mirror, except with a ZIP as the target.

I know that a ZIP file has some overhead associated with its internal
tables, but is the added space worth the effort of keeping a single
file?

The alternative is the overhead of the directory structure, and the
cluster tips on each file. If your cluster size is 4096, you waste
approximately 2048 bytes per file. This can really add up when you
have a large number of small files. Then the archive pays off, even if
you don't compress.

The key for me is squeezing it down to fit my backup on one DVD 4.7
gig. Then the process can run unattended, kicking off a defrag, and
index afterwards.

I like to get backups away from the computer so anyone trashing my
machine or stealing it won't get all the backups too.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 
W

Wojtek

Roedy Green wrote :
The alternative is the overhead of the directory structure, and the
cluster tips on each file. If your cluster size is 4096, you waste
approximately 2048 bytes per file. This can really add up when you
have a large number of small files. Then the archive pays off, even if
you don't compress.

"ISO 9660 - CD-ROM Specifications

The smallest entity in the CD format is called a frame, and holds 24
bytes. Data in a CD-ROM are organized in both frames and sectors. A
CD-ROM sector contains 98 frames, and holds 2352 bytes."

http://www.experiencefestival.com/a/ISO_9660_-_Specifications/id/5151842

Mind you you still get wasted space, just not as much...

If you are using Windows, then RMB on the root directory of your
archive and choose "Properties". The display will tell you "Size" and
"Size on Disk", the difference being sector waste.
 
R

Roedy Green

I compress a large number of files to create a ZIP backup on DVD and
USB flash drive. I do it from scratch each day to avoid accumulating
deadwood.

The backup utility is now posted. It is considerably faster that
zipping the whole thing from scratch the way I used to do it. See
http://mindprod.com/products1.html#BACKUPTOZIP

As usual, it is free and comes with complete source.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top