java.util.zip not handling Unicode filenames

Chris · Oct 28, 2005

I have a zip file that contains files with Asian filenames.
java.util.zip.ZipFile opens it, but the filenames are garbled. Is there any
way to handle filenames that contain unicode characters?

Chris Uppal · Oct 28, 2005

Chris said:
I have a zip file that contains files with Asian filenames.
java.util.zip.ZipFile opens it, but the filenames are garbled. Is there
any way to handle filenames that contain unicode characters?

The ZIP file format is not Unicode aware. Filenames are just 8-bit strings
internally, with (afaik) no defined charset, or Unicode encoding.

That means that the writer and reader of the file have to agree on a mutually
comprehensible format.

If you are in control of both ends (and if you weren't using java.util.zip.*)
then it would be sensible to standardise on UTF-8. If not then you'll have to
find out what character encoding the application that created the archive was
using (or more likely assuming without thinking about it), and attempt to
decode that.

But now comes the /really really stupid/ bit. What encoding do you think the
java.util.zip.* assumes ? One sensible choice would be to treat the filenames
as "raw" character data, so you would at least be able to see the character
values that were used in the ZIP file names and could reverse-encode them.
Another choice, even more sensible (but more work), would be to allow us to
specify the charset used when decoding the binary form of the names. A rather
stupid idea would be to hard-code a charset. An especially stupid idea would
be to hard-code the choice of a variable-width charset since that could easily
make it impossible to read a valid file. A really, really, stupid idea would
be to hard-code a variable-width charset that nobody else on Earth ever uses.

Guess what Sun have done...

The java.util.zip.* stuff is hardwired with the assumption that filenames in
ZIP files (and probably comments too, but I haven't checked) are encoded in the
weird non-standard, incompatible, modified version of UTF8 that is used
internally by some low-level JVM stuff (principally the JNI interface, and JVM
classfile format) ! Nobody else on Earth uses it for anything.

Why ? I dunno. I just can't get over how stupid it is...

Anyway, the upshot of that is that you will only be able to read ZIP filenames
correctly if one of the following apply:

1) The filenames are all 7-bit ASCII.

2) The filenames are Unicode, AND all the characters are in the 16-bit Unicode
range, AND the application that generated the ZIP file used UTF8 for filename
encoding. (That /should/ work over the limited range since nul-bytes are
either illegal or unlikely in file names).

3) The filenames are Unicode, AND the OS is Windows (which uses UTF-16 for
filenames), AND the application that wrote it was under the impression that
Unicode is 16-bit and therefore blindly converted each 16-bit quantity into 1,2
or 3, bytes in the same way as UTF-8 would.

4) The file was created with Sun's bloody idiotic java.util.zip.*.

In point of fact, neither (2) nor (3) seem particularly unlikely to me, but
apparently neither case applies to your problem.

If the filenames are /not/ Unicode but are in the user's local charset (8-bit
or variable-width) then you are stuffed, I'm afraid. There's not even any
guarantee that the java.util.zip.* will not throw errors when it tries to read
the filenames, since they may not be valid byte-sequences according to the
encoding that Sun has hard-wired. The same goes if the filenames are Unicode,
use characters outside the 8-bit range, and were written in correct UTF-8.

-- chris

Roedy Green · Oct 31, 2005

modified version of UTF8

does this encoding have a name in the nio CharSet sense?

What are the differences between the variants?

Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
java.util.zip Limitations	47	Apr 22, 2004
java.util.zip on hpux	8	Dec 1, 2005
reading filenames from stdin - with umlauts?	18	Jul 27, 2008
Plot seems weird	7	Dec 25, 2011
Why 'files.py' does not print the filenames into a table format?	32	Jun 15, 2013
Saving (unusual) linux filenames	3	Aug 31, 2010

java.util.zip not handling Unicode filenames

Chris

Chris Uppal

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads