possible to read self-extracting zip file?

B

Bomb Diggy

Hi,

Is it possible to use java.util.zip to read and decompress/inflate a
self-extracting zip file? I've been able to unzip regular zip files
(produced by WinZip), but not self-extracting zip files (produced by
Winzip and PKZip).

Thanks.
 
R

Roedy Green

Is it possible to use java.util.zip to read and decompress/inflate a
self-extracting zip file? I've been able to unzip regular zip files
(produced by WinZip), but not self-extracting zip files (produced by
Winzip and PKZip).

Just a guess. Try renaming the extension and see if that is sufficient
to fool it.
 
M

Marco Schmidt

Bomb Diggy:
Is it possible to use java.util.zip to read and decompress/inflate a
self-extracting zip file? I've been able to unzip regular zip files
(produced by WinZip), but not self-extracting zip files (produced by
Winzip and PKZip).

A zip file looks like this (simplified):

HEADER1 FILE1 HEADER2 FILE2 [... other pairs] HEADER1 HEADER2 [...all
headers]

So the headers are repeated at the end of the archive.

A self-extracting zip file looks like almost like this, only a CODE
prefix is different:

CODE HEADER1 FILE1 [...rest as above]

where CODE is some executable code with unzip functionality that
searches the file it is in for headers and then unzips the data.

The problem with java.util.zip is that it never reads the header
directory at the end, which would be much quicker than going over the
complete file. It always expects a stream to start with a header.

So your only chance is to write code that searches for the first
header in an InputStream (skipping any potential CODE section), then
wrap a ZipInputStream around that. Then you can work with that
self-extracting zip file.

Here are pointers to the zip file format specs and related
information:
<http://www.geocities.com/SiliconValley/Lakes/6686/zip-archive-file-format.html>.
IIRC a header starts with P K \003 \004 (or similar numbers, it's in
the specs). That's what you must search for.

Regards,
Marco
 
M

Marco Schmidt

Roedy Green:
Structure of a zip file is that there are embedded headers on each
element, then a summary complete set of headers at the end.

That's what I tried to explain.
You have
to scan backwards doing a bit a fancy footwork to find the last
header, then you can chase backwards through the summaries to the
first one.

You don't have to do that. But it's quicker than reading the complete
file, collecting local headers on the way.
Now chop the head off and you should have something java can eat. But
in the meantime, you have almost written your own ruddy zip package!

No, as I pointed out, all you have to do is search for the first local
header. Then you can wrap that stream into a ZipInputStream. The
searching probably has to be done with some unread capability like
PushbackInputStream.

If you "chop off" the CODE section, the offsets in the "header
summary" (central directory) at the end of the ZIP archive are
incorrect. Java can still read those files because it never touches
that central directory. But the ZIP file is corrupted.
I could write you a beast to do this in Java, C++, C or MASM as a
separate prestep for $50 US. It would convert an exe to a zip.

If you want to create a valid ZIP, that's a not-so-trivial task (see
above). 50 USD is little for that functionality.
Phil Katz of Pkzip.com put the 8.3 PKZip format into the public
domain. It must have been mildly changed to deal with long file names.

The idea of the duplication was you could recover some of the elements
from a corrupted file.

I think it was more about speed. With a normal archive it's much more
likely that file data got corrupted. Headers are relatively small.

But it's been a while that I've read the ZIP specs.

Regards,
Marco
 
R

Roedy Green

You don't have to do that. But it's quicker than reading the complete
file, collecting local headers on the way.

If you buffer appropriately, you will do less i/o scanning backwards
through the pure headers at the end than if you scan forward through
the embedded headers. Also if you have an exe file, I don't know how
you would find the first header if you scan forward. It may be there
is a quick way to find the first header. That would be ideal. If Phil
were clever he would have hidden it at offset 0 of the program proper,
right after the exe header. I don't know if these are DOS, Win16 or
Win32 exe headers.
 
M

Marco Schmidt

Roedy Green:
If you buffer appropriately, you will do less i/o scanning backwards
through the pure headers at the end than if you scan forward through
the embedded headers.

Yes, that's what I was saying - getting the central directory is
faster.
Also if you have an exe file, I don't know how
you would find the first header if you scan forward. It may be there
is a quick way to find the first header. That would be ideal. If Phil
were clever he would have hidden it at offset 0 of the program proper,
right after the exe header. I don't know if these are DOS, Win16 or
Win32 exe headers.

As I said in said:
IIRC a header starts with P K \003 \004 (or similar numbers, it's in
the specs). That's what you must search for.

Just search for that local header signature. It doesn't matter what
platform the native stub was written for. Make some sanity checks so
that you don't get that signature stored in the native code.

Regards,
Marco
 
R

Roedy Green

Just search for that local header signature. It doesn't matter what
platform the native stub was written for. Make some sanity checks so
that you don't get that signature stored in the native code.


That strikes me as a tad dangerous. The string could appear in the
code itself.

For DOS headers it is fairly simple to jump over the relocation
header, and the exe portion. For Win16 it is more complex. For win32
I'd guess getting yucky.

Do we know which style of header he is dealing with? What program
created them?

I wrote a little utility years ago that could at least tell DOS and
Win16 style apart. It is on my website at
http://mindprod.com/products2.html#ISWIN
 
M

Marco Schmidt

Roedy Green:
That strikes me as a tad dangerous. The string could appear in the
code itself.

But that's what I meant when I wrote "Make some sanity checks so
that you don't get that signature stored in the native code." It
should be relatively easy to identify a real local header.
For DOS headers it is fairly simple to jump over the relocation
header, and the exe portion. For Win16 it is more complex. For win32
I'd guess getting yucky.
Do we know which style of header he is dealing with? What program
created them?

I wouldn't want to try to interpret the native executable part. Then
again, I have no experience with those, maybe it's easier than I
think.
I wrote a little utility years ago that could at least tell DOS and
Win16 style apart. It is on my website at
http://mindprod.com/products2.html#ISWIN

At <http://www.wotsit.org> or in any Unix magic file there probably
are the signatures to identify types of executables. But that's only
part of finding out where the actual data starts. I'd rather implement
my variant, checking for P K \003 \004 (or whatever the numbers after
PK are).

Regards,
Marco
 
T

Tor Iver Wilhelmsen

Marco Schmidt said:
A self-extracting zip file looks like almost like this, only a CODE
prefix is different:

CODE HEADER1 FILE1 [...rest as above]

where CODE is some executable code with unzip functionality that
searches the file it is in for headers and then unzips the data.

Actually just look at the PE spec, it will tell you the EXE header
format and there you can find out where the file's data segment (the
zip stream) begins.
 
M

Marco Schmidt

Roedy Green:
I noticed the new PKZip format now allows files > 2 GIG.
see http://mindprod.com/jgloss/zip.html

Unfortunately, both PKWare and WinZip Computing added proprietary,
undocumented extensions (compression and encryption types) to ZIP.
Once these take off, the ZIP file format will be much less valuable.
There is a discussion on comp.compression on the topic, see
<http://www.geocities.com/SiliconValley/Lakes/6686/zip-archive-file-format.html>
for links to that discussion and information on the format.

Regards,
Marco
 
B

Babu Kalakrishnan

Bomb Diggy:


The problem with java.util.zip is that it never reads the header
directory at the end, which would be much quicker than going over the
complete file. It always expects a stream to start with a header.

So your only chance is to write code that searches for the first
header in an InputStream (skipping any potential CODE section), then
wrap a ZipInputStream around that. Then you can work with that
self-extracting zip file.

I'm certain that some classes in the java.util.zip or java.util.jar can
handle zip files with code prepended at the start. The java -jar command
(which uses the same class libraries to load classes) works fine on
executable jar files even when the jar file has extra code up front. So
it must be using the central directory to get at the files.

I think using the ZipFile class instead of a ZipInputStream would most
probably work. (Not tested - just an educated guess).

BK
 
R

Roedy Green

? I've been able to unzip regular zip files

The other thing to consider is Winzip and Pkzip have a large variety
of compressing algorithms. I don't know which ones Java supports. If
you are preparing Zip for java, you have to control which algorithms
it uses.
 
B

Babu Kalakrishnan

The other thing to consider is Winzip and Pkzip have a large variety
of compressing algorithms. I don't know which ones Java supports. If
you are preparing Zip for java, you have to control which algorithms
it uses.

True. Java supports only the "deflate" compression scheme (the ones used
by gzip - RFC 1950 to 1952) since the compression/decompression engine
used by it is the open source "zlib" library.

BK
 
M

Marco Schmidt

Babu Kalakrishnan:
True. Java supports only the "deflate" compression scheme (the ones used
by gzip - RFC 1950 to 1952) since the compression/decompression engine
used by it is the open source "zlib" library.

I think Java also supports uncompressed entries.

Regards,
Marco
 
R

Roedy Green

I think Java also supports uncompressed entries.

I was just looking at WZZIP the command line part of Winzip. It gives
you no way to control the algorithms. The Winzip32 also has a command
line interface where you can suggest if you want fast or thorough, but
not the particular algorithm.

I am working on a project called the Replicator which simply
distributes a set of files and keeps them up to date. For now I will
use jar.exe to create the zips and later do it with the zip classes.

I think PKZIP may give the control needed plus speed for the zip
creation.

The Winzip people announce that the new version uses a form of
compression the old versions can't read. This screws up zip format
for interchange. The PKZip people seem to have a proprietary
algorithm now too. These formats need to be open and if not
universally supported, at least suppressible.
 
R

Roedy Green

I think PKZIP may give the control needed plus speed for the zip
creation.

I downloaded the evaluation copy of PKzip. The information on you
control it from the command line is hidden in a file called pkzipc.pdf
which is not indexed anywhere. pkzipc.exe is the command line
version. pkzipw.exe is the Windows GUI version.

It lets you specify the precise algorithm on the command line.
However, that is not really what you usually want.

You want compatibility with something else, so you want it to use a
SELECTION of algorithms, or to avoid some algorithm, not to use one
particular one which may be inappropriate.

I think though for Java use, forcing it to deflate only is what you
want.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top