creating identical zip archives with java and zip tools

M

m.niinimaki

Hi,

a simple problem: we have JavaEE based server that accepts ZIP files,
and a JavaEE based client that creates and uploads them. We verify
that the ZIP file is identical at client and server by MD5.

I'd like to write a client with some other language.. in fact even a
shell/wget script would do. But here's the problem: if I create a ZIP
archive with java.util.zip, it is not the identical to an archive
created by command line tools. Here's a simple example:

cat hello.txt
hello
zip hello1.zip hello.txt
java makezip
[creates hello2.zip, code below]
ls -l hello1.zip hello2.zip

-rw-r--r-- 1 x x 174 2010-08-24 12:23 hello1.zip
-rw-r--r-- 1 x x 140 2010-08-24 12:24 hello2.zip

Is there a way of forcing compatibility on either the zip tool (on
Linux) or Java?

TIA,
Mark

import java.io.*;
import java.util.zip.*;
public class makezip {
public static void main(String[] args) {
String file_to_add = "hello.txt";
byte[] buf = new byte[1024];
try {
String outFilename = "hello2.zip";
ZipOutputStream out = new ZipOutputStream(new
FileOutputStream(outFilename));
FileInputStream in = new FileInputStream(file_to_add);
// Add ZIP entry to output
stream.
out.putNextEntry(new ZipEntry(file_to_add));
int len;
while ((len = in.read(buf)) > 0) { out.write(buf, 0,
len); }
out.closeEntry(); in.close();
out.close();
} catch (IOException e) { }
}
}
 
L

Lew

m.niinimaki said:
a simple problem: we have JavaEE based server that accepts ZIP files,
and a JavaEE based client that creates and uploads them. We verify
that the ZIP file is identical at client and server by MD5.

I'd like to write a client with some other language.. in fact even a
shell/wget script would do. But here's the problem: if I create a ZIP
archive with java.util.zip, it is not the identical to an archive
created by command line tools. Here's a simple example:

cat hello.txt
hello
zip hello1.zip hello.txt
java makezip
[creates hello2.zip, code below]
ls -l hello1.zip hello2.zip

-rw-r--r-- 1 x x 174 2010-08-24 12:23 hello1.zip
-rw-r--r-- 1 x x 140 2010-08-24 12:24 hello2.zip

Is there a way of forcing compatibility on either the zip tool (on
Linux) or Java?

You have to set the compression to be the same for both. You can't even
guarantee that client and server get the same result if they use the same tool
otherwise.

Regardless, you're doing it wrong. You don't repeat the zip process on the
server, you repeat the MD5 calculation. Then you don't care if they use your
custom Java, WinZIP, jar, arc or what-have-you. The server compares the MD5
hash (or whatever hash you choose) that it calculates to the one provided by
the client.

Think about how you verify the hash of downloads. You don't recompress
things, you compare the hash value they provide to one you calculate.

You don't need compatible zip implementations. You don't even need to use the
same compressor. You don't even need to compress the files at all. Whichever
route you take, you can successfully compare well-defined hash values like MD5.
 
S

Screamin Lord Byron

m.niinimaki wrote:
We verify

Regardless, you're doing it wrong. You don't repeat the zip process on
the server, you repeat the MD5 calculation.

Maybe the server can change the zip contents. In that case zip files
itself must be the same on both client and server if their contents are
the same, of course, or hashing must be done on the contents itself.
 
B

BGB / cr88192

m.niinimaki said:
Hi,

a simple problem: we have JavaEE based server that accepts ZIP files,
and a JavaEE based client that creates and uploads them. We verify
that the ZIP file is identical at client and server by MD5.

I'd like to write a client with some other language.. in fact even a
shell/wget script would do. But here's the problem: if I create a ZIP
archive with java.util.zip, it is not the identical to an archive
created by command line tools. Here's a simple example:

ZIP+Deflate is internally non-trivial, and apart from validating the
decompressed data, what is required can't actually be done in a general
sense (unless of course, all parties involved have to use the same
implementation and version of both the deflater and the zip code, ...).

basically, it is analogous to running the same source code through several
different compilers, and expecting the exact same results in each case.


the partial reason is that deflate is not just some simple transform of the
input data to the output compressed data, but actually involves a fair
amount of internal pattern matching and heuristics, and typically between
implementations there will be many minor variations in pattern-matching and
heuristic behaviors.

examples of variations:
greedy vs non-greedy strings matching (always match longest run up-front, or
allow a shorter match if the compressor guesses this will lead to a longer
match later, ...);
how far to search back backwards, and the exact string lengths to check for
along the way, ... (such as due to performance tradeoffs, where always doing
max depth at max length will tend to be a little slow, especially in
non-greedy implementations which may do a lot of extra matches in searching
for the "best" strings to match, ...).

within ZIP, there is also the matters of exact field settings, ...

the result then is that there tends to be some amount of internal variation
between compressed files.
 
L

Lew

Screamin said:
Maybe the server can change the zip contents. In that case zip files
itself must be the same on both client and server if their contents are
the same, of course, or hashing must be done on the contents itself.

I can't parse your remarks. If the server changes the zip contents,
then the files will differ between client and server, no? The case
where the server changes the contents is the case where their contents
are not the same, of course.

In any case, as BGB and I pointed out, even with the same zip engine
you can get different results, much less between different products.
The only reliable way is to use a consistent algorithm like MD5 to
check the results. The check is to compare the hash of the received
file with the other end's calculation of that hash, not to redo the
file and recompute an original hash. Where's the confirmation in
that?
 
S

Screamin Lord Byron

I can't parse your remarks. If the server changes the zip contents,
then the files will differ between client and server, no?

I suppose I wasn't clear enough.
The case
where the server changes the contents is the case where their contents
are not the same, of course.

Consider this case:

Client sends A.ZIP to server.
Client puts file X.TXT to A.ZIP

Server puts the file X.TXT (same data) to the received A.ZIP (so it must
recompress it).

Client sends new A.ZIP again (or its hash)

Server sees hashes are different (because compression differs from
client to server) and concludes the files must be different, which is
true for zip files, but the files inside those zips are in fact the same.

So, in this case the solution would be the hashing of the uncompressed
contents of a zip in a reproducible fashion, and sending that hash
instead of the zip file hash.
 
L

Lew

Screamin said:
Consider this case:

Client sends A.ZIP to server.
Client puts file X.TXT to A.ZIP

Server puts the file X.TXT (same data) to the received A.ZIP (so it must
recompress it).

Client sends new A.ZIP again (or its hash)

Server sees hashes are different (because compression differs from
client to server) and concludes the files must be different, which is
true for zip files, but the files inside those zips are in fact the same.

So, in this case the solution would be the hashing of the uncompressed
contents of a zip in a reproducible fashion, and sending that hash
instead of the zip file hash.

No, the solution is to send the hash of the new zip file (with X.TXT
included) along with the new zip file and have the other end confirm
that the hash of its received file calculates to the same value.

If the "Client sends new A.ZIP again" it needs to send the hash with
it. You don't duplicate the zip on both sides! You duplicate the
calculation of the hash!

This nonsense about creating the same changes on two sides is rococo
to the extreme.
 
S

Screamin Lord Byron

No, the solution is to send the hash of the new zip file (with X.TXT
included) along with the new zip file and have the other end confirm
that the hash of its received file calculates to the same value.

Which it won't be in the presented case. In any other case I would first
send only hash (with some id of the file) and let server decide if it
has that unchanged file. If it already has it, there is no need to send
it again.

If the "Client sends new A.ZIP again" it needs to send the hash with
it. You don't duplicate the zip on both sides! You duplicate the
calculation of the hash!

This nonsense about creating the same changes on two sides is rococo
to the extreme.

OP did say that he needs to check if the files are the same both on the
client and on the server. Why would he want to do that I don't know. I
just provided one case in which exact result of the compression
algorithm matters -- same contents - different hashes (which is what
bothered him in the first place).

Your suggestion (which is the usual and quite obvious way you would do
it if you didn't have compressed files that might change its contents on
both sides) doesn't work in that case, as much as rococo nonsense that
case might be.
 
R

Roedy Green

I'd like to write a client with some other language.. in fact even a
shell/wget script would do. But here's the problem: if I create a ZIP
archive with java.util.zip, it is not the identical to an archive
created by command line tools. Here's a simple example:

This is true of ANY two command line or library tools. The best you
can hope for is you get the same contents when you fluff them back up.

Each utility is using its proprietary tweaks to the compression.

Further, Java fails to fill in all the indexing fields.

If you need binary identicality, you will need to run your zipper via
the commandline/exec interface. See
http://mindprod.com/jgloss/exec.html

See http://mindprod.com/jgloss/truezip.html for a multiplatform Zip a
bit fancier than the one that comes bundled.
 
S

steph

Which it won't be in the presented case. In any other case I would first
send only hash (with some id of the file) and let server decide if it
has that unchanged file. If it already has it, there is no need to send
it again.



OP did say that he needs to check if the files are the same both on the
client and on the server. Why would he want to do that I don't know. I
just provided one case in which exact result of the compression
algorithm matters -- same contents - different hashes (which is what
bothered him in the first place).

Your suggestion (which is the usual and quite obvious way you would do
it if you didn't have compressed files that might change its contents on
both sides) doesn't work in that case, as much as rococo nonsense that
case might be.

zip format already contains a checksum for each file in the archive.
this checksums are verified by unzip tools.
so if server get the hash - such as SHA1 - of the zip and verify it;
then, if ok, it unzip the content, it will be sure the files are the
same as on the client.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top