java.util.zip Limitations

A

Andrew Thompson

ZipFile, uses TOC at end, allows random access - fast

ZipInputStream, goes through entries sequentially,
not use TOC - slower ....
..on the right track? ]
...
It's worth adding that the thing you are trading-off for speed is the
ability to read or write the Zip format without arbitrarily seeking in
the file. As a case in point, the OPs application was reading the data
off the network, so it's natural to consume it as a single forward-only
scan.

A very good point Chris. When I was doing
one computer course long ago the person running
it asked whether HD or _magnetic tape_ was the
quickest to 'read a file'.

Of course, we all incorrectly guessed HD,
the instructor then proceeded to demonstrate
that the huge file, sequential read from the
tape drive, was actually a tad quicker than
reading off HD.

The moral of that story: data storage medium
appropriate to the circumstance and intended use.
 
C

Chris Uppal

Gerry said:
In my
particular application, the data arrives in formatted packets, possibly
intermingled with packets of other NWS products. So I have to save all the
packets to a file before starting the decompression.

Well, you don't /have/ to. It sounds easy enough to create a custom
InputStream that pulls the "packets" off the network (passing intermingled data
off to other handling) and provides the zipped data to the ZipInputStream on
demand. Not that it's necessarily /worth/ doing that -- my guess is that it
wouldn't be -- it depends on the rest of your app (and on your "style" as a
programmer).

-- chris
 
J

Joseph Dionne

Chris said:
Joseph Dionne wrote:




That's exactly what does happen (though the exception could be clearer). If
you try to use the ToC you get an exception. If you don't try to use the ToC
(ie, if you use ZipInputStream) then you don't. What could be simpler ?

I will avoid the temptation to write "no" repeatedly <grin>. Actually
what happens when a zip file of concern to the OP, one without a valid
TOC, is "opened" by ZipFile, it fails in the native open method open of
ZipFile, preventing one from getting the exception you mentioned.
Perhaps a corruption in the zip file TOC, a few bad bytes, might
generate the exception mentioned. I will attempt to verify this
possibility.
You can't. The ToC contains data that is not included in the body of the file
(which is a fairly major design error IMO -- presumably something to do with
maintaining compatibility as the format has evolved).

Perhaps "you can't," however I can. Admittedly, some data will need to
be supplied, and as Mr. Thompson has previously noted, this should not
be a automatic feature applied to all files, like his large zip
repository driving his website, however for some, like the original OP,
this would be a useful feature.
It might help if you though of the class java.util.ZipFile as being named
java.util.ZipFileIndex -- in many ways that would be a better name, since it is
really the index that is represented, not the whole file.

Of course, there could be an option to "fix" (heuristically) broken ZIP files.
Some of the ZIP applications do have options to do such things. There's no
obvious reason why the Java library should provide such options, though,
anymore than it provides an option to try to recover compressed data that has
been corrupted by a CR <-> CR/LF conversion.

-- chris

Zip file technology has been evolving for some twenty years. I have had
the (dis)pleasure of using every evolution over that same period, at
times a very painful experience. My intent of this discussion is to
head off a repeat of history. Jar is the only zip utility that cannot
read a stream, i.e. "jar tvf - < file.ZIS" is invalid syntax.

Again, reading a stream, one would use ZipInputStream and not ZipFile,
and not take advantage of the TOC. However, identifing a non zip file
is not based on the presense or lack of a valid TOC. Run unzip a one of
you .java files, and it will report that the file "might not" be a zip
file. I am sure there is no TOC in your .java file. Yet run unzip on
the .ZIS file of the original OP, and you will find it is accepted as a
valid zip file. This means that unzip uses other criteria before giving
up. I am only suggesting that the technology learned over the last
twenty years regarding zip files should be part of Java's zip classes,
at a minimum.

As a final note, my thanks to all who have posted. It is only my hope
that in continuing this conversation that we as developers ask, nay
demand, better from the developers of the languages we use, just like
our customers ask/demand more from us. I for one want to spend time
creating new applications, not finding the flaws of the languages I
selected to create those applications.
 
C

Chris Uppal

Joseph said:
I will avoid the temptation to write "no" repeatedly <grin>.

I'm glad you took all those "no"s as I had intended. I realised afterwards
that I should at least have scattered a few smilies around -- it could be been
read as hostile when I only meant it jokingly.

Actually
what happens when a zip file of concern to the OP, one without a valid
TOC, is "opened" by ZipFile, it fails in the native open method open of
ZipFile, preventing one from getting the exception you mentioned.
Perhaps a corruption in the zip file TOC, a few bad bytes, might
generate the exception mentioned. I will attempt to verify this
possibility.

I think it's quite likely that it doesn't reallise that there /is/ a ToC at
all. I know that if I use my own code to try to open a Zip file with extra
stuff added on at the end, or truncated, then that's what happens. The reason
is sheds an interesting side-light on the quality of design in the ZIP format
;-) The problem is that the "starting point" for finding the ToC is to to
find the "central directory [=ToC] end record". That entry is at the end of
the file, but it is of variable size. And you can only work out how big it is
by finding its start. Stupid, eh ? So, what you have to do is scan backwards
through the file looking for a special marker-sequence of four bytes. If you
find one (in the last 64K or so), then you have to verify that it isn't just an
accidental occurence of that pattern. There are various tests you can make to
see if the apparent "end record" is plausible, and one of the easiest and most
obvious is that it does indeed extend to exactly the end of the file. If the
file has been added-to or truncated, then the "record" will be deemed invalid,
and -- this is the important point -- the search will continue looking further
backwards in the file.

Since the search never finds an acceptable "end record", the software can only
conclude that it's not reading a Zip file at all, rather than thinking it's got
a corrupt one. (I'm considering modifiying my own implementation so that it
can optionally be told to use the "best available candidate" if none is
perfect -- though I'm not yet convinced that's a feature I'll need).

Perhaps "you can't," however I can. Admittedly, some data will need to
be supplied, and as Mr. Thompson has previously noted, this should not
be a automatic feature applied to all files, like his large zip
repository driving his website, however for some, like the original OP,
this would be a useful feature.

As above, I'm not really convinced that the ability to repair damaged files
should be part of the standard API for manipulating those files. It seems to
me that that is better handled by a separate tool. Or at least a separate API.
For one reason, it's less complicated for the developer/user to learn. For
another the "forgiving" parser probably won't share a lot of code with the
"correct" parser.

Zip file technology has been evolving for some twenty years. I have had
the (dis)pleasure of using every evolution over that same period, at
times a very painful experience.

Displeasure is right. I've not been using ZIP stuff for as long as you, but I
remember when I first started programming on DOS / early Windows and being
gob-struck at how utterly brain-dead the ZIP utilities were after I'd been used
to *NIX tar and compress.

I am only suggesting that the technology learned over the last
twenty years regarding zip files should be part of Java's zip classes,
at a minimum.

I think I understand your point. But I don't really agree with it.

If an API is defined with abilities roughly like the Zip utilities (i.e. making
best efforts to retrieve data, even if it means falling back to another access
method; pretending that it is possible to replace one file in a Zip file as a
single operation, and so on) then I think that should be a higher-level API,
and separate from the relatively low-level APIs that exist. As far as I can
see, the current low-level APIs are just about sufficient to allow Sun to
create a higher level one on top of them[*]. Perhaps you are right and some
such API should be added to the existing stuff (I can certainly think of uses
for it), but it should be separate from, and in addition to, the current stuff.
I would hate to see it all rolled-up into one big ball where I didn't have
control of (and perhaps didn't even know) what the code is doing.

-- chris

[*] One big nuisance is that the natural name for such a high level class would
be java.util.ZipFile, which is already in use :-(
 
J

Joseph Dionne

Chris Uppal wrote:

{snip}

Sorry my response is so delayed, I've been heads down in project without
time to play. At times the theoretical must make room for the practical.

All your points are well taking, and in keeping with OOP and Java
principles. Perhaps this is where my issue derives -- I have
fundamental disagreements with OOP and the direction Java is taking. As
a c language programmer from K&R forward, I love the simplicity of that
language, believing less is more.

With Java, the "API" is so large, and growing, it is approaching the
point where effective use will be diminished because of its eminence
size. So with that said, I shall address your following points.
If an API is defined with abilities roughly like the Zip utilities (i.e. making
best efforts to retrieve data, even if it means falling back to another access
method; pretending that it is possible to replace one file in a Zip file as a
single operation, and so on) then I think that should be a higher-level API,
and separate from the relatively low-level APIs that exist.

While I agree on principle with this, utilities build up, from the
lower, more simpler components to higher, more advanced functionality.
Since opening a file is, IMHO, the simplest of operations, a higher
level utility to "repair" a zip file would still need to open that same
zip file. (Remember that the zip file in question is still valid, just
without one valid structure, the TOC.) However. ZipFile cannot be used
to open this file, so a complete replacement of ZipFile is needed, not
an extension of ZipFile.
As far as I can
see, the current low-level APIs are just about sufficient to allow Sun to
create a higher level one on top of them[*].

The functionality I added in myZipFile class, using InputStream and
ZipStream, have the ability to 1) differentiate between an invalid zip
file, for example myZipFile.java, and one such as the OPs zip file
missing the TOC, and 2) simulate the functionality of ZipFile, less
direct access of course (which I intend to add in the future). The
replacement is functional, and demonstrates my point that the native
method ZipFile.open() could do the same, thus making ZipFile more
useful, not less. (Look mom, no utilities!)
Perhaps you are right and some
such API should be added to the existing stuff (I can certainly think of uses
for it), but it should be separate from, and in addition to, the current stuff.
I would hate to see it all rolled-up into one big ball where I didn't have
control of (and perhaps didn't even know) what the code is doing.

I see no loss of control because ZipFile.open() can detect a truly bad
zip file, or open the OPs limited zip file. And, by extending ZipFile,
the OOP/Java way, to create your higher level utility, control is once
again restored -- you could fail to open a zip file that does not
contain a valid TOC for instance, by throwing an exception.

It remains my opinion that as a "user" of the Java language, I have the
right to "demand" reasonable behavior from the language developers, just
like my customers/users demand reasonable behavior from my work.
Perhaps I am just getting too picky in my old age, but when I began in
software development, the concept of open source, community developed
OSes and languages was just that -- a concept many of us desired. Now
that these concepts are being realized, I wish to push the envelope
further, by raising the awareness of language developers that their
"customers" are we that use the fruits of their labor.

joseph
 
J

Joseph Dionne

Chris Uppal wrote:

{snip}

(Corrected version, I really need ro read my words before sending them.)

Sorry my response is so delayed, I've been heads down in project without
time to play. At times the theoretical must make room for the practical.

All your points are well taking, and in keeping with OOP and Java
principles. Perhaps this is where my issue derives -- I have
fundamental disagreements with OOP and the direction Java is taking. As
a c language programmer from K&R forward, I love the simplicity of that
language, believing less is more.

With Java, the "API" is so large, and growing, it is approaching the
point where effective use will be diminished because of its immense
size. So with that said, I shall address your following points.
If an API is defined with abilities roughly like the Zip utilities (i.e. making
best efforts to retrieve data, even if it means falling back to another access
method; pretending that it is possible to replace one file in a Zip file as a
single operation, and so on) then I think that should be a higher-level API,
and separate from the relatively low-level APIs that exist.

While I agree on principle with this, utilities build up, from the
lower, more simpler components to higher, more advanced functionality.
Since opening a file is, IMHO, the simplest of operations, a higher
level utility to "repair" a zip file would still need to open that same
zip file. (Remember that the zip file in question is still valid, just
without one valid structure, the TOC.) However. ZipFile cannot be used
to open this file, so a complete replacement of ZipFile is needed, not
an extension of ZipFile.
As far as I can
see, the current low-level APIs are just about sufficient to allow Sun to
create a higher level one on top of them[*].

The functionality I added in myZipFile class, using InputStream and
ZipStream, have the ability to 1) differentiate between an invalid zip
file, for example myZipFile.java, and one such as the OPs zip file
missing the TOC, and 2) simulate the functionality of ZipFile, less
direct access of course (which I intend to add in the future). The
replacement is functional, and demonstrates my point that the native
method ZipFile.open() could do the same, thus making ZipFile more
useful, not less. (Look mom, no utilities!)
Perhaps you are right and some
such API should be added to the existing stuff (I can certainly think of uses
for it), but it should be separate from, and in addition to, the current stuff.
I would hate to see it all rolled-up into one big ball where I didn't have
control of (and perhaps didn't even know) what the code is doing.

I see no loss of control because ZipFile.open() can detect a truly bad
zip file, or open the OPs limited zip file. And, by extending ZipFile,
the OOP/Java way, to create your higher level utility, control is once
again restored -- you could fail to open a zip file that does not
contain a valid TOC for instance, by throwing an exception.

It remains my opinion that as a "user" of the Java language, I have the
right to "demand" reasonable behavior from the language developers, just
like my customers/users demand reasonable behavior from my work.
Perhaps I am just getting too picky in my old age, but when I began in
software development, the concept of open source, community developed
OSes and languages was just that -- a concept many of us desired. Now
that these concepts are being realized, I wish to push the envelope
further, by raising the awareness of language developers that their
"customers" are we that use the fruits of their labor.

joseph
 
C

Chris Uppal

Joseph said:
Sorry my response is so delayed, I've been heads down in project without
time to play. At times the theoretical must make room for the practical.

It's Good to eat...

All your points are well taking, and in keeping with OOP and Java
principles. Perhaps this is where my issue derives -- I have
fundamental disagreements with OOP and the direction Java is taking. As
a c language programmer from K&R forward, I love the simplicity of that
language, believing less is more.

I agree that Java is over-complex, but I suspect that we are thinking of
different things...

With Java, the "API" is so large, and growing, it is approaching the
point where effective use will be diminished because of its immense
size.

Yes, we are. The class library is only an application of Java -- albeit one
that is standardised. I prefer to keep the concepts of the language and the
library separate.

The API is huge, and growing. It has some nasty kludges due to early design
flaws. But I don't see how to avoid that and still have a lot of pre-packaged
functionality. It's nothing to do with OO really, just the shear number of
application domains that Sun have added to the mix.

Perhaps it'd be better if there were a clearer separation between a small
"core" library that could be thought of as part of the Java language (akin to
the C standard library), and a much larger set of code that is only part of the
Java "platform".

As far as I can
see, the current low-level APIs are just about sufficient to allow Sun
to create a higher level one on top of them[*].

The functionality I added in myZipFile class, using InputStream and
ZipStream, have the ability to 1) differentiate between an invalid zip
file, for example myZipFile.java, and one such as the OPs zip file
missing the TOC, and 2) simulate the functionality of ZipFile, less
direct access of course (which I intend to add in the future). The
replacement is functional, and demonstrates my point that the native
method ZipFile.open() could do the same, thus making ZipFile more
useful, not less. (Look mom, no utilities!)

Well, you can look at this in another way, and say that the fact that you have
done this shows that it's not necessary to package this functionality as part
of the /standard/ class library. (Which, after all, we have just agreed is
dauntingly big).

It remains my opinion that as a "user" of the Java language, I have the
right to "demand" reasonable behavior from the language developers, just
like my customers/users demand reasonable behavior from my work.

I very much agree with this in general. There is a specific point though: in
my experience, users tend to want solutions that are "too" simple -- that is
the solution they ask for (first) is actually simpler than the task itself,
which means that they wouldn't always be able to do their work with it. Part
of our job as designers/programmers is to ensure that we know /all/ the
/necessary/ complexities of their task, and ensure that we don't forget that
the users will have to deal with "odd" cases, even though the users themselves
tend to do so.

I'm of the opinion that there's a similar "pressure" from consumers of a class
library for X, for it to be simplified to reflect the limited concept of X that
is most common. It's the job of the class library designer to ensure that the
process of dropping/hiding "inconvenient" aspects of X doesn't go too far.

Of course, it's a matter of opinion whether Sun's Zip library has gone too far
in the opposite direction. I think you've made a fair case that it has. (But,
I don't agree so strongly that I've changed the API of my own Zip library to
include the features you are seeking ;-)

-- chris
 
M

Michael Borgwardt

Joseph said:
With Java, the "API" is so large, and growing, it is approaching the
point where effective use will be diminished because of its immense
size.

Nobody forces you to use it. You can always roll your own. I just
don't see how that is more effective.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,226
Latest member
KristanTal

Latest Threads

Top