Andrew said:
What if they contain information for (what
was it you called it) 'pre-training' of the
compression algorithm?
Normally a (LZ-style) compressor starts in a "blank" state, and it only starts
to achieve compression once it has seen enough input data to recognise
exploitable repetitions. So it can't provide any compression until at least
the second time it sees, say, the string "java/lang/Object". The idea of
pre-training is that if you warm up the compressor by giving it some typical
text to process before the real data, then it will hit the ground running as
far as its real work is concerned. The problem is that the training data has
to be known to both the compressor and decompressor in advance; if not then
you'll have to transmit that data /as well/ and there'll be a nett increase in
the size of the transmission. The exception to that is if you are sending
several (independently) compressed files which have "similar" contents. In
that case, it might be worthwhile to send the training data /once/, followed by
the compressed (with identical training) data for the N real files.
E.g. a modified JAR file format might allow a "TRAINING.DAT" file somewhere at
the start of the JAR (just after the manifest, say) which would be used to
pre-train all the [de]compressors used for the contained .class files. Of
course, that would not then be a valid ZIP format file, since it would be using
non-standard compression. (And, as I pointed out in the post you are
referring to, it doesn't seem to be a worthwhile technique for .class files --
the gains are too small to justify the effort.)
Anyway, the point of this is that it wouldn't help to add data to each of the
classfiles. Of course, you /could/ split up the training data and share it out
amongst several of them, but that seems a little, um, perverse ;-)
Mickey Segal swears the MS .cab files are
smaller than the equivalent .zip's, I thought
(maybe) they were doing some clever things
for compression. As people point out though,
zip format is not that wasteful over some of
the best algorithms, so how could MS pull it
off in a .cab file?
I know /very/ little about how CAB files work. The only description I've seen
suggests that MS are using a compression scheme (LZX) that is algorithmically
similar to the one used in ZIP files, but which has a significantly tighter
(and /much/ more intricate) way of encoding the repetitions that it has
discovered. I get the impression too that the encoding can be tuned
(statically) to include something analogous to training (CAB format is not
intended to be general-purpose, so MS can use knowledge of what the typical CAB
file will contain, and can choose the details of the encoding appropriately).
But, to be honest, I don't really understand the description, so I could easily
be wrong...
The bottom line seems to be that CAB just has a "better" compressor than ZIP
(at least for some sorts of files). Which is not unexpected -- it was designed
about a decade later, and can assume larger memories and faster machines, so it
/ought/ to be better even without any extra cleverness.
Pre-training using information stored in those
extraneous bytes might be a way to explain it..
+ (and this is an important point) if the
standard 'jar' format is used on the *same*
class files, whith the relatively uncompressible
extraneous bytes that do not do anything useful
for the Sun VM's.. those together might add up
to the difference..
So Sun designed their format so that M$ could compress it better than they
themselves could ? Unlikely, I fear ;-)
OK.. I think I've finished musing on obsolete
VM's.. time to get back to writing applet's
that will work in all 1.1+ VM's...
An unusual and challenging vocation you have chosen, Mr Thompson. You might
find digging ditches more pleasant...
-- chris