java.util.zip Limitations

A

Andrew Thompson

....
Well, it's more than 40 lines,

100 lines for the entire post is pretty good.
----------- sample code here --------------

package kg4nbb.emwin;

Hey, ..what's that in it for?

Did you read the SSCCE link _carefully_? ;-)

(Some of those lines are pretty long too)

I'll try to have a look at your code
tomorrow, but am secretly hoping sombody
else will do it for you before then..
 
J

Joseph Dionne

Andrew said:
100 lines for the entire post is pretty good.




Hey, ..what's that in it for?

Did you read the SSCCE link _carefully_? ;-)

(Some of those lines are pretty long too)

I'll try to have a look at your code
tomorrow, but am secretly hoping sombody
else will do it for you before then..

I'll just throw this into the mix. jar tvf file.zis works fine.
Perhaps there is a bug in the java.util.zip package.
 
C

Chris Uppal

Gerry said:
The NWS data stream is sent as packets of information. But the people who
wrote it were not good at packet stuff, so all the packets are exactly
1024 bytes, padded with 0's if necessary. And there's no indicator in the
packet header of the length of the actual data. So, the last packet of the
file (including zip files) has a bunch of extra 0's on the end. My code
removes the 0's if it recognizes (from the file name) that the data is a
text file. But I didn't want to start messing with the various binary file
types, so I leave the 0's in place for all of them.

That's where your problem lies. The Zip structure ends with a record that
itself finishes with a comment encoded as a 2-byte size (in Intel byte order)
followed by that many bytes of comment. So if the Zip file has no comment (the
typical case) the last two bytes will both be 0.

There are two ways of reading Zip files. One is to start at the end, where
there is a "table of contents" which allows random access to the elements of
the Zip. That's what a java.util.zip.ZipFile does, and it reads the table of
contents somewhere as part of its constructor. The 0 padding added by the
weather people is buggering that up, and your attempts to remove the padding
aren't quite right either. You could fix that by analysing the end of the
data yourself, to find how many 0s it *should* have on the end, but that's
messy. And fortunately you don't need to do it.

The other way of reading a Zip file is to start at the beginning and iterate
over each element, ignoring the table of contents. To do that in Java you use
a java.util.zip.ZipInputStream. ZipInputStream is a rather weird class
(because of the weird nature of the Zip file format). Here's an example of how
to use it (very hacky but I hope it's clear enough) that I've just tested with
the file that Andrew posted a link to (which has 124 bytes of padding on the
end), and it seems to work fine.

-- chris

========== Dezip.java =============
import java.util.zip.*;
import java.io.*;

public class Dezip
{
public static void
main(String[] args)
throws Exception
{
ZipInputStream zis
= new ZipInputStream(
new BufferedInputStream(
new FileInputStream(
args[0])));

ZipEntry entry;
byte[] buffer = new byte[1024];

while ((entry = zis.getNextEntry()) != null)
{
int got;
System.out.println(entry);
while ((got = zis.read(buffer)) >= 0)
System.out.write(buffer, 0, got);
System.out.println("========");
zis.closeEntry();
}
}
}
=======================
 
A

Andrew Thompson

Here's an example of how
to use it (very hacky but I hope it's clear enough) that I've just tested with
the file that Andrew posted a link to (which has 124 bytes of padding on the
end), and it seems to work fine.

Ahhh. Good. After you reported problems
with that link I went back to double check
it. At first it was not working, (even more
oddly, the applet loaded and also failed
to get the Zip).

Next minute it was fine.

The only thing I can put it down to, is my
flakey server being a bit sleepy (shrugs).

If I remember I might remove those files
tonight, no need of them any longer, and
I am sure the world does not need a web-page
in which one can read the weather in ..Samoa
was it? on some particular date!
========== Dezip.java =============
import java.util.zip.*;

Nice example.

I was about to point the OP towards the code
for Ziplet, then realised that not only is
it spread across mutiple classes, and contains
a lot of floss he does not need, but ultimately
it _entirely_ relies on the JEditorPane to
actually get the content..

No help at all! ;-)
 
G

Gerry Wheeler

The other way of reading a Zip file is to start at the beginning and
iterate over each element, ignoring the table of contents. To do that in
Java you use a java.util.zip.ZipInputStream.

Bingo! That works great!

OK, now I can get back to my original program and get these things
unzipped.

Many, many thanks to everyone who has contributed!
 
J

Joseph Dionne

Gerry said:
Bingo! That works great!

OK, now I can get back to my original program and get these things
unzipped.

Many, many thanks to everyone who has contributed!

I'm am very glad to see your application is now working, but I believe
it to be only a work around. Modifying Mr. Wheeler's sample code to use
ZipInputStream instead of ZipFile, the native ZipFile.open() exception
avoided, however ZipInputStream seems to be able to deal with the
missing comment correctly. (Modified version follows)

My concern is that there exists a bug in the ZipFile.open(), believing
it should be able to handle the same data the ZipInputStream does.

However, I might be being too picky.


import java.io.File;
import java.io.FileInputStream;
import java.util.Enumeration;
import java.util.Date;
import java.util.zip.*;
import java.util.jar.*;

/**
* This is a test program to attempt to unzip a file from the
* National Weather Service's EMWIN data stream.
*
* Under normal use, the file would arrive in packets within
* the EMWIN stream. This simplified test assumes the file has
* already been saved and is ready to be unzipped.
*
* @author gwheeler
*/
public class TestZipFile {

/** Creates a new instance of TestZipFile */
public TestZipFile() {
}

/**
* The main code, where everything starts.
*
* @param args[0] the name of the zip file
*/
public static void main(String[] args) {
if (args.length == 1) {
new TestZipFile().test(args[0]);
}
else {
System.err.println("usage: TestZipFile <filename>");
}
}


/**
* Attempts to unzip the specified file.
*/
private void test(String filename) {
try {
File f = new File(filename);
if (f.exists()) {
// Here's where the problem usually occurs.

System.out.println("attempting to open " + f);
/*
// Wont work (sometimes)!
ZipFile z = new ZipFile(f);
*/
ZipInputStream z = new ZipInputStream(
new FileInputStream(f));

System.out.println("zip file " + f + " opened");

// Just show what's in it.
/*
Enumeration entries = z.entries();
while (entries.hasMoreElements()) {
ZipEntry entry = (ZipEntry)entries.nextElement();
System.out.println(" found entry " +
entry.getName());
}
*/
// Works (always)!
ZipEntry ze ;

while (null != (ze = z.getNextEntry())) {
Date dt = new Date(ze.getTime());
System.out.println(
ze.getSize() + " "
+ dt.toString() + " "
+ ze.getName());
}
}
else {
System.err.println("can't find " + f);
}
}
catch (Exception e) {
// For testing purposes, just catch and report all exceptions.

e.printStackTrace();
}
}
}
 
C

Chris Uppal

Joseph said:
I'm am very glad to see your application is now working, but I believe
it to be only a work around. Modifying Mr. Wheeler's sample code to use
ZipInputStream instead of ZipFile, the native ZipFile.open() exception
avoided, however ZipInputStream seems to be able to deal with the
missing comment correctly. (Modified version follows)

My concern is that there exists a bug in the ZipFile.open(), believing
it should be able to handle the same data the ZipInputStream does.

I don't think so. As I tried to explain in my earlier post, ZipFile uses the
table of contents at the end of the zip file, and hence is unhappy to find that
the table is corrupt. The ZipInputStream, OTOH, makes no use of the table (in
fact it never even reads that far in the input), and so is unfazed.

The point is that there are two *completely* different ways of accessing the
Zip file structure -- it's designed to work for either sequential access *or*
random access, and has two different internal data-structures to support the
two different patterns. The padding (and stripping off the padding) damaged
one of these structures, but left the other untouched. By a happy chance, the
undamaged structure was the one that was best suited to Gerry's application.

java.util.ZipFile and java.util.ZipFileStream are the APIs corresponding to the
two different access methods. It would be much better if the documentation
*explained* that and also explained what the algorithms were (in this case it
is *not* an "implementation detail") and the tradeoffs between them, but...

-- chris
 
J

Joseph Dionne

Chris said:
Joseph Dionne wrote:

[snip]
My concern is that there exists a bug in the ZipFile.open(), believing
it should be able to handle the same data the ZipInputStream does.


I don't think so. As I tried to explain in my earlier post, ZipFile uses the
table of contents at the end of the zip file, and hence is unhappy to find that
the table is corrupt. The ZipInputStream, OTOH, makes no use of the table (in
fact it never even reads that far in the input), and so is unfazed.

And because of this "fact," one can work around the differences in
behavior between ZipFile and ZipInputStream. This is a good thing.
The point is that there are two *completely* different ways of accessing the
Zip file structure -- it's designed to work for either sequential access *or*
random access, and has two different internal data-structures to support the
two different patterns. The padding (and stripping off the padding) damaged
one of these structures, but left the other untouched. By a happy chance, the
undamaged structure was the one that was best suited to Gerry's application.

Agreed, while the manor by which ZipFile and ZipInputStream approach the
compressed data is different, is it not true that a bad zip file is
always bad file? If unzip, zipinfo, jar, and other zip file commands
deal with the NWS zip file/stream, would it not be consistent for
ZipFile and ZipInputStream to both do the same, accept or reject it?
java.util.ZipFile and java.util.ZipFileStream are the APIs corresponding to the
two different access methods. It would be much better if the documentation
*explained* that and also explained what the algorithms were (in this case it
is *not* an "implementation detail") and the tradeoffs between them, but...

The fact that the behavior of ZipFile and ZipInputStream differ causes
the dilemma of knowing when to use one over the other. Since I know
this behavior exists, I for one will never again use ZipFile, creating
instead a MyZipFile class that incorporates the logic I know to work
(see the working example in previous post). If developers regularly
using zip files from outside sources do the same, has not the usefulness
of ZipFile be dismissed to the point of irrelevancy, and even worse,
continued confusion? IMHO.

I guess the real question is "when is a bug a bug?" M$ has answered
this question many times by saying "when we say so."
 
A

Andrew Thompson

So, to clarify (for my own understanding)

ZipFile, uses TOC at end, allows random access - fast

ZipInputStream, goes through entries sequentially,
not use TOC - slower

Is that right?
[ OK an oversimplification,
but on the right track? ]

.....
Agreed, while the manor by which ZipFile and ZipInputStream approach the
compressed data is different, is it not true that a bad zip file is
always bad file? If unzip, zipinfo, jar, and other zip file commands
deal with the NWS zip file/stream, would it not be consistent for
ZipFile and ZipInputStream to both do the same, accept or reject it?

No. It may be that those tools use
the sequential method, or that they
try the TOC first, and fall back to
doing a sequential read if need be.

Yep, I agree with that.

Sometimes I think the high level classes
offered by Sun obscure too much of what
is actually happening.

Not that I'd want to have to code all
that stuff myself.. ;-)
 
J

Joseph Dionne

Andrew said:
So, to clarify (for my own understanding)

ZipFile, uses TOC at end, allows random access - fast

ZipInputStream, goes through entries sequentially,
not use TOC - slower

Is that right?
[ OK an oversimplification,
but on the right track? ]

....
Agreed, while the manor by which ZipFile and ZipInputStream approach the
compressed data is different, is it not true that a bad zip file is
always bad file? If unzip, zipinfo, jar, and other zip file commands
deal with the NWS zip file/stream, would it not be consistent for
ZipFile and ZipInputStream to both do the same, accept or reject it?


No. It may be that those tools use
the sequential method, or that they
try the TOC first, and fall back to
doing a sequential read if need be.

And here is where ZipFile should be "fixed," if one agrees the inability
of opening a, be it debatable, "good" zipfile a bug. If the TOC is
invalid, or corrupt, the random access method should throw an exception,
i.e. "ZipException: TOC not available" or better yet, the TOC should be
created from the file, since it is an existing file and not a stream.
By ZipFile failing to merely open a zipfile, such as provided by NWS,
the implication is the entire file is bad, i.e. not a zip file, not just
a lesser part of the zipfile.

[snip]
 
A

Andrew Thompson

Andrew Thompson wrote: ....

And here is where ZipFile should be "fixed," if one agrees the inability
of opening a, be it debatable, "good" zipfile a bug. If the TOC is
invalid, or corrupt, the random access method should throw an exception,
i.e. "ZipException: TOC not available"

A better description in the exception
would be a good idea, combined with
better documentation of the methods
...or better yet, the TOC should be
created from the file, since it is an existing file and not a stream.

I disagree.

I have an 11 Meg file on the net
at the moment. Let's say I want to
pull any _particular_ single file out
for inspection (as I do right here
<http://www.physci.org/source.jsp>).

IIUC, the random access method (ZipFile)
would be far better for that. In the event
that the TOC becomes corrupted, I would prefer
to have the program throw exceptions, which
can then allow me to catch them and proceed in
the best manner.

For a file on the local filesystem, I might
be tempted to simply revert to the ZIS method,
it would not matter if it required the entire
11 Meg of the file be read.

The server is a different matter, and I would
probably remove the page that accesses the zip
until I could get a fresh zip with a valid
TOC uploaded.

Now.. you might reason..

OK, so they should allow fall-back to sequential
_unless_ the programmer specifically requests
FAIL_ON_BAD_TOC.

Yes they _could_ do that, but then that would
be just as, if not more, reliant on effective
documentation, which brings us straight back to
the concern expressed by chris.

The existing docs are not good enough, if
they were, this problem would simply be
'options'. If you need to add futher
back-up methods etc., that would seem to
complicate matters, as _those_ methods
would then require effective documentation.
By ZipFile failing to merely open a zipfile, such as provided by NWS,
the implication is the entire file is bad, i.e. not a zip file, not just
a lesser part of the zipfile.

I am not sure what you mean by 'implication'
here, but feel that any vagueness could be
removed with better documentation and error
information.
 
J

Joseph Dionne

Andrew said:
A better description in the exception
would be a good idea, combined with
better documentation of the methods




I disagree.

I have an 11 Meg file on the net
at the moment. Let's say I want to
pull any _particular_ single file out
for inspection (as I do right here
<http://www.physci.org/source.jsp>).

Forgive my ignorance, but I believe my http client app cannot open a zip
file via the Internet with ZipFile. thus would need to access it via a
ZipInputStream. So, this reason does not refute my premise, that on a
local filesystem, ZipFile, and ZipInputStream should similar behaviors.

Upon further review, ZipInputStream cannot even identify a zip file
stream, it simply returns no entries. Now, I as an application
developer cannot even determine if the valid zip file is simply empty.
The server is a different matter, and I would
probably remove the page that accesses the zip
until I could get a fresh zip with a valid
TOC uploaded.

As the webmaster, that is something you would do for any failing link,
and rightly so. Yes, the TOC is mandatory, and I assume the zip files
on you website were either created by you, or at least validated by you
prior being placed on you servers. So, again, this argument does not
address the case at hand, i.e. downloading a zip file created by another
entity, and using pure Java to read what the app expected as a valid zip.
Now.. you might reason..

OK, so they should allow fall-back to sequential
_unless_ the programmer specifically requests
FAIL_ON_BAD_TOC.

Yes they _could_ do that, but then that would
be just as, if not more, reliant on effective
documentation, which brings us straight back to
the concern expressed by chris.

The existing docs are not good enough, if
they were, this problem would simply be
'options'. If you need to add futher
back-up methods etc., that would seem to
complicate matters, as _those_ methods
would then require effective documentation.

Perhaps is it just my feeling, but I find most Java docs provide only
the basic of information. Thank goodness for newsgroups and the
community sharing our learned experiences. Sometimes, using Java is
very much like using M$ VC API -- keep trying things until one of them
gets the desired affect.
I am not sure what you mean by 'implication'
here, but feel that any vagueness could be
removed with better documentation and error
information.

Obviously a zip file/stream has certain "milestones" that must meet
certain reasonable data checks. Using the modified version of the
TestZipFile sample in a previous post, I get the same behavior reading
an empty .zip file as I would get reading non .zip file, i.e. a .java
file lets say. ZipFile should do the milestone checks, best guess
checks, that the file is indeed a zip file, returning an open error if
the file is obviously not a .zip file, or returning a ZipFile object if
it is valid. If the TOC is not available, throw an exception from the
random access method, allowing me, the application developer, the choice
of whether to continue or not.

As it stands now I have no way to "know" if the .zip file is valid or
not. As I said, an invalid .zip file and a non .zip file behave the same
using ZipInputStream.

Sir, I dont wish to belabour this point, but I believe the "hard work"
should be on the language developers, not those using their language. I
apply this same philosophy to my applications -- I get paid the big
bucks to ease the burdens of my users, not to add to their miseries.
 
R

Roedy Green

The other way of reading a Zip file is to start at the beginning and iterate
over each element, ignoring the table of contents. To do that in Java you use
a java.util.zip.ZipInputStream. ZipInputStream is a rather weird class
(because of the weird nature of the Zip file format).

One thing you have to watch out for is Java created Zip files are
incomplete. Not all the fields are filled in. This is because it does
no back patching to the embedded index in its strictly sequential
creation pass.

see http://mindprod.com/jgloss/zip.html for details.
 
C

Chris Uppal

Andrew said:
So, to clarify (for my own understanding)

ZipFile, uses TOC at end, allows random access - fast

ZipInputStream, goes through entries sequentially,
not use TOC - slower

Is that right?
[ OK an oversimplification,
but on the right track? ]

That's right.

It's worth adding that the thing you are trading-off for speed is the
ability to read or write the Zip format without arbitrarily seeking in
the file. As a case in point, the OPs application was reading the data
off the network, so it's natural to consume it as a single forward-only
scan.

-- chris
 
C

Chris Uppal

Roedy said:
One thing you have to watch out for is Java created Zip files are
incomplete. Not all the fields are filled in. This is because it does
no back patching to the embedded index in its strictly sequential
creation pass.

Agreed about the facts.

I'm not sure that I agree with the phrasing though. The files it creates are
not "incomplete"; all the data is there, and it is in full conformance with the
spec.

However the *way* that it's written can confuse a badly-written decoder since
the information on a file's size, compressed size, and CRC is placed *after*
the file. A sufficiently badly-written (or deliberately incomplete) decoder
might have difficulty coping with that. I.e. the spec is *designed* to allow
Zip format data to be written without back-patching, but some broken readers
may not be able to cope.

AFAIK, the only reader software that *does* suffer from that inability is Sun's
own implementation. At least, I've seen comments to that effect that on this
NG -- I haven't actually tested it myself.

-- chris
 
C

Chris Uppal

Joseph said:
And here is where ZipFile should be "fixed," if one agrees the inability
of opening a, be it debatable, "good" zipfile a bug. If the TOC is
invalid, or corrupt, the random access method should throw an exception,
i.e. "ZipException: TOC not available"

That's exactly what does happen (though the exception could be clearer). If
you try to use the ToC you get an exception. If you don't try to use the ToC
(ie, if you use ZipInputStream) then you don't. What could be simpler ?
or better yet, the TOC should be
created from the file, since it is an existing file and not a stream.

You can't. The ToC contains data that is not included in the body of the file
(which is a fairly major design error IMO -- presumably something to do with
maintaining compatibility as the format has evolved).

It might help if you though of the class java.util.ZipFile as being named
java.util.ZipFileIndex -- in many ways that would be a better name, since it is
really the index that is represented, not the whole file.

Of course, there could be an option to "fix" (heuristically) broken ZIP files.
Some of the ZIP applications do have options to do such things. There's no
obvious reason why the Java library should provide such options, though,
anymore than it provides an option to try to recover compressed data that has
been corrupted by a CR <-> CR/LF conversion.

-- chris
 
C

Chris Uppal

Joseph said:
As it stands now I have no way to "know" if the .zip file is valid or
not. As I said, an invalid .zip file and a non .zip file behave the same
using ZipInputStream.

If haven't tested this myself, but I'll take your word for it. That is indeed
a deficiency (or perhaps a bug -- fairly major either way) in the
implementation of ZipInputStream.

It should read until it either finds a correctly formatted entry representing
the start of the ToC (and hence the end of the real data), and then return
null. Or it should throw an exception indicating that the data is corrupt
beyond some point (from the very start of the data if it isn't in ZIP format).
It should *not* fail silently.

-- chris
 
C

Chris Uppal

Joseph said:
And because of this "fact," one can work around the differences in
behavior between ZipFile and ZipInputStream. This is a good thing.

The way you've put "fact" in quotes suggests to me that you don't understand
the way the ZIP file format is designed. It is not an accident of
implementation, its a feature of the format's design that is very properly
reflected in the Java classes that implement that design.

Agreed, while the manor by which ZipFile and ZipInputStream approach the
compressed data is different, is it not true that a bad zip file is
always bad file?

No.

The ZIP format is a *data format* not a file format. It is designed to support
a "stream like" approach, where the only data that an application has is what
it has read *so far* (off the network, or from a tape, or whatever), and so it
*has* to be able to interpret what it has seen without waiting for the end of
the input. Hence "errors" in the file that occur after any given entry are
*irrelevant* to that entry, and hence irrelevant to any application that is
doing a forward-only pass over the data.

Other applications don't restrict themselves to forwards-only, but allow
themselves to "know" that the data is held in a seekable format (such a normal
disk file). Such applications, and such applications *only*, will use the
table-of-contents and will be sensitive to errors in it.

If unzip, zipinfo, jar, and other zip file commands
deal with the NWS zip file/stream, would it not be consistent for
ZipFile and ZipInputStream to both do the same, accept or reject it?

No.

Those programs are only some of the applications of the ZIP format. Any
individual one may use only a fraction of the power in the format, or they may
try to expose all the power. I don't know which do and which don't. One way
to see is to ask which of them can read/write ZIP-encoded data to/from a tape.
Any that can't must be relying on the random-access features of hard-disk
files. Equally you can check to see how efficient they are at retreiving data
from an entry near the end of a big ZIP file. Any that are slow are presumably
doing a forward-only scan and failing to make use the random-access
features.

The Java classes expose both feature-sets.

If you want life to be simple, and for there just to be an easy-to-use and
easy-to-understand class library, then either you will have to drop some of the
features (a bad thing) or you will have to design something that is better than
the Sun-supplied stuff.

Take it from me, that's hard. I've recently being going through the exercise
of designing a class library for manipulating ZIP formatted data (in a
different language) and it is *not easy* to find a workable compromise, let
alone something that is both simple and comprehensive.

I don't particularly admire the design Sun have come up with (but then I don't
think much of the design of the ZIP format either). But it is genuinely
difficult to do better (my own attempt uses more layering).

I think that the real problems are that:

a) not everyone realises what the ZIP format is *for* (they've only every used
it for files).

b) the documentation is *terrible*.

The fact that the behavior of ZipFile and ZipInputStream differ causes
the dilemma of knowing when to use one over the other. Since I know
this behavior exists, I for one will never again use ZipFile, creating
instead a MyZipFile class that incorporates the logic I know to work
(see the working example in previous post).

No, no, no, no, no. NO ! You have it all wrong.

The two ZIP classes in the Java library are intended for different purposes.
In some cases there's an overlap and either could be used (albeit with
different performance tradeoffs). In most cases one is at least clearly
preferable to the other; and in some cases only one or the other can possibly
be used. It's *your* job, as a programmer, to understand the issues and make
an intelligent choice based on your understanding. To the extent that you
don't understand the issues then you are not doing your job properly (and to
the extent that that is the fault of the Sun documentation -- 100% I suspect --
the Sun programmers weren't doing their job properly either).

I guess the real question is "when is a bug a bug?" M$ has answered
this question many times by saying "when we say so."

<grin>

Either of the two classes may have bugs or deficiencies, of course, in any
given implementation. But bugs are not the issue here. For the data under
discussion, both classes (in the JDK 1.4.2 implementation) are working
perfectly.

-- chris
 
G

Gerry Wheeler

As a case in point, the OPs application was reading the data off the
network, so it's natural to consume it as a single forward-only scan.

It's more complicated than that, but you're basically correct. In my
particular application, the data arrives in formatted packets, possibly
intermingled with packets of other NWS products. So I have to save all the
packets to a file before starting the decompression.

Nevertheless, your point is well-made. An application could unzip a file
as it arrived in a stream if necessary, and ZipInputStream facilitates
that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,229
Latest member
GloryAngul

Latest Threads

Top