ZipOutputStream

E

Erik

ZipOutputStream: if I put an uncompressed file named "mimetype" into
the zip and the content of the file is "application/html+zip", there
are 5 bytes between the filename and the content:

01 14 00 EB FF

Normally (in Winzip), uncompressed content immediately follows the
filename, shows WinZIP. What are those 5 bytes for and how can I do
things the WinZIP way ?
In other words, I want to get rid of those 5 bytes when building my
ZIP file.
Is there some setting for this in the class ?
 
E

Erik

Some additional info from WinZIP about the Java-generated zip file:


Central directory entry PK0102 (4+42): #1
======================================
part number in which file begins (0000): 1
relative offset of local header: 0 (0x00000000)
bytes
version made by operating system (00): MS-DOS, OS/2, NT
FAT
version made by zip software (20): 2.0
operat. system version needed to extract (00): MS-DOS, OS/2, NT
FAT
unzip software version needed to extract (20): 2.0
general purpose bit flag (0x0008) (bit 15..0): 0000.0000
0000.1000
file security status (bit 0): not encrypted
extended local header (bit 3): yes
compression method (08): deflated
compression sub-type (deflation): normal
file last modified on (0x00003c39 0x0000b52a): 2010-01-25
22:41:20
32-bit CRC value: 0x2cab616f
compressed size: 25 bytes <<====
uncompressed size: 20 bytes
length of filename: 8 characters
length of extra field: 0 bytes
length of file comment: 0 characters
internal file attributes: 0x0000
apparent file type: binary
external file attributes: 0x00000000
non-MSDOS external file attributes: 0x000000
MS-DOS file attributes (0x00): none
Current Location part 1 offset 1369563
filename:mimetype
 
E

Erik

problem solved. This is how it must be done:

public void create(ArrayList<FileItem> list,String path, String
fileName) {

// Create a buffer for reading the files
byte[] buf = new byte[1024];
System.out.print("Create " + fileName + ": ");
try {

ZipOutputStream out = new ZipOutputStream(new
FileOutputStream(path + fileName));
out.setMethod(ZipOutputStream.DEFLATED); // file mimetype
must be uncompressed
out.setLevel(Deflater.DEFAULT_COMPRESSION);
// Compress the files
for (FileItem fi : list) {
String fn = fi.dir + fi.fileName;

FileInputStream in = new
FileInputStream(root.getPath() + "\\" + fn);

// Add ZIP entry to output stream.
ZipEntry z = new ZipEntry(fn);
if (fn.equals("mimetype")) {

z.setMethod(ZipOutputStream.STORED); // file
mimetype must be uncompressed
// These three MUST be set. However, I think they may be set to
anything...
z.setSize(20); // length of data
z.setCompressedSize(20);
z.setCrc(0x2cab616f);
}
else {
z.setMethod(ZipOutputStream.DEFLATED);
}

out.putNextEntry(z);

// Transfer bytes from the file to the ZIP file
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}

// Complete the entry
out.closeEntry();
in.close();

}

// Complete the ZIP file
out.close();
System.out.println("OK.");
}
catch (IOException e) {
System.out.println(e.getMessage());
}
}
 
R

Roedy Green

ZipOutputStream: if I put an uncompressed file named "mimetype" into
the zip and the content of the file is "application/html+zip", there
are 5 bytes between the filename and the content:

ZIP format goes back to the DOS days. Originally file names were
ASCII. I suspect there is some kludge in effect for filenames with a
"weird" character like +, rather than simple UTF-8.

The ZIP format is documented at PkZip.com. See
http://mindprod.com/jgloss/zip.html

You might learn something looking at Sun's code. It may be native at
the low level though.

WinZip people relatively recently started supporting Unicode
filenames. I don't know how they are encoded internally.
 
R

Roedy Green

file last modified on (0x00003c39 0x0000b52a): 2010-01-25

One problem with ZIP format that bedevils me is that when you put a
file into a zip, then restore it, the timestamp can be out by up to 2
seconds! The restored file looks like a DIFFERENT version of the file.

Further the timestamps are in local timezone rather than GMT, and the
timezone is not recorded. Arrgh. I have been bugging the Winzip and
the Truezip people to fix this.

Vendors are reluctant, I think, primarily because an upward compatible
solution would make files fatter. Archivers compete ferociously.

http://mindprod.com/jgloss/compressionutilities.html.

The evil that men do lives after them. The good is oft interred with
their bones.
~ William Shakespear (born: 1564-04-23 died: 1616-04-23 at age: 52)
Julius Caesar Act II scene ii
 
R

Robert Kochem

Erik said:
Normally (in Winzip), uncompressed content immediately follows the
filename, shows WinZIP. What are those 5 bytes for and how can I do
things the WinZIP way ?

There are two ways for creating an uncompressed ZIP:

1. Mode: DEFLATER with compression = 0 (Deflater.NO_COMPRESSION)
2. Mode: STORED (required to compute manually the CRC and set size in the
ZipEntry:

byte[] data = // data to store
ZipEntry ze = new ZipEntry("entryname.txt");
ze.setMethod(ZipEntry.STORED);
ze.setCompressedSize(data.length);
ze.setSize(data.length);
CRC32 = new CRC32();
crc.update(data);
ze.setCrc(crc.getValue());
zipOutputStream.putNextEntry(ze);

Robert
 
A

Arne Vajhøj

One problem with ZIP format that bedevils me is that when you put a
file into a zip, then restore it, the timestamp can be out by up to 2
seconds! The restored file looks like a DIFFERENT version of the file.

The format only has 5 bits for seconds.

No surprise that it can be off.
Further the timestamps are in local timezone rather than GMT, and the
timezone is not recorded. Arrgh. I have been bugging the Winzip and
the Truezip people to fix this.

Vendors are reluctant, I think, primarily because an upward compatible
solution would make files fatter. Archivers compete ferociously.

The ZIP format is a well-defined format (defined in APPNOTE).

Picking a new time format would make it not zip.

And would make it unreadable by all other zip tools out there.

Arne
 
A

Arne Vajhøj

ZIP format goes back to the DOS days. Originally file names were
ASCII. I suspect there is some kludge in effect for filenames with a
"weird" character like +, rather than simple UTF-8.

The ZIP format is documented at PkZip.com. See
http://mindprod.com/jgloss/zip.html

Then why don't you read it instead of speculating.

Unicode filename support was added in version
6.3.0.

It uses a flag and stores the filename as UTF-8.

Whether that is a kludge or not is rather subjective.

Arne
 
M

Mike Schilling

Arne said:
Then why don't you read it instead of speculating.

Unicode filename support was added in version
6.3.0.

It uses a flag and stores the filename as UTF-8.

Whether that is a kludge or not is rather subjective.


I'm confused. How is "+" a weird character than can't be stored as
ASCII?
 
R

Roedy Green

I'm confused. How is "+" a weird character than can't be stored as
ASCII?

+ is an odd character for filenames. It usually means concatenation.
Perhaps Phil Katz originally used some simple compression on ASCII
filenames. It has been a long time since I studied the file format.

Remember that PkZip started out as with the DOS 8.3 all
case-insensitive file system.

The way to answer these questions:

1. read spec at PkZip.com
2. read docs at WinZip.com
3. create some sample zip files and look at them with a hex editor.
4. compress and fluff some sample files and compare
attributes/timestamps.

See http://mindprod.com/jgloss/zip.html
http://mindprod.com/jgloss/pkzip.html
http://mindprod.com/jgloss/winzip.html
http://mindprod.com/jgloss/hex.html
 
T

Tom Anderson

The format only has 5 bits for seconds.

No surprise that it can be off.


The ZIP format is a well-defined format (defined in APPNOTE).

Picking a new time format would make it not zip.

And would make it unreadable by all other zip tools out there.

There is an 'extra field' in the file header record. It's structured into
tag-length-value chunks which can hold arbitrary extra metadata. Tag
0x5455 is not formally standardised, but is one of the listed "third party
mappings commonly used", and is described as "extended timestamp". You
will note that taken as a two-character ASCII string, 0x5455 is "UT". It
seems to be defined and quasi-standardised by InfoZIP; see this file from
InfoZIP hosted by your new favourite microchip manufacturer:

http://www.opensource.apple.com/source/zip/zip-6/unzip/unzip/proginfo/extra.fld

Which explains that it can contain any combination of modification,
access, and creation times, described by a bitfield, and that:

The time values are in standard Unix signed-long format, indicating the
number of seconds since 1 January 1970 00:00:00. The times are relative
to Coordinated Universal Time (UTC), also sometimes referred to as
Greenwich Mean Time (GMT).

Although looking at the InfoZIP source code, there seems to be a lot of
special-casing which suggests to me that not all tools follow those rules
to the letter.

There are also a variety of more formally standardised OS-specific
metainfo blocks, which can contain timestamps. A polyglot tool which could
read all these could provide better timestamps on extracted files even in
the absence of a 0x5455 header.

tom
 
A

Arne Vajhøj

There is an 'extra field' in the file header record. It's structured
into tag-length-value chunks which can hold arbitrary extra metadata.
Tag 0x5455 is not formally standardised, but is one of the listed "third
party mappings commonly used", and is described as "extended timestamp".
You will note that taken as a two-character ASCII string, 0x5455 is
"UT". It seems to be defined and quasi-standardised by InfoZIP; see this
file from InfoZIP hosted by your new favourite microchip manufacturer:

http://www.opensource.apple.com/source/zip/zip-6/unzip/unzip/proginfo/extra.fld


Which explains that it can contain any combination of modification,
access, and creation times, described by a bitfield, and that:

The time values are in standard Unix signed-long format, indicating the
number of seconds since 1 January 1970 00:00:00. The times are relative
to Coordinated Universal Time (UTC), also sometimes referred to as
Greenwich Mean Time (GMT).

Although looking at the InfoZIP source code, there seems to be a lot of
special-casing which suggests to me that not all tools follow those
rules to the letter.

There are also a variety of more formally standardised OS-specific
metainfo blocks, which can contain timestamps. A polyglot tool which
could read all these could provide better timestamps on extracted files
even in the absence of a 0x5455 header.

You are correct.

And extension would not break anything.

And if implementation could actually start agreeing on
using it, then it could become very useful.

Arne
 
R

Roedy Green

The time values are in standard Unix signed-long format, indicating the
number of seconds since 1 January 1970 00:00:00. The times are relative
to Coordinated Universal Time (UTC), also sometimes referred to as
Greenwich Mean Time (GMT).

Finally, some progress. The thing that is so funny about these
problems is any one solution is trivial. The difficulty is introducing
it in a way that does not trip up other users of the files, and
persuading people to converge on a common solution. The precise
details of how it works are almost irrelevant since only a very few
programmers ever have to deal with it. Everyone else will deal with
it via a simple API.

The other problem is trying to persuade some vendor to pioneer the
feature. Vendors are reluctant to do so, even if they see the need,
because soon after a slightly different consensus scheme may be
introduced leaving them with an incompatible legacy.

I hope someone does a thesis on these sorts of problem, researching
the politics involved and how successful consensuses are reached
quickly.

Maybe the game theorists could explain the behaviours.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top