not handling Unicode filenames



I have a zip file that contains files with Asian filenames. opens it, but the filenames are garbled. Is there any
way to handle filenames that contain unicode characters?



Chris Uppal

Chris said:
I have a zip file that contains files with Asian filenames. opens it, but the filenames are garbled. Is there
any way to handle filenames that contain unicode characters?

The ZIP file format is not Unicode aware. Filenames are just 8-bit strings
internally, with (afaik) no defined charset, or Unicode encoding.

That means that the writer and reader of the file have to agree on a mutually
comprehensible format.

If you are in control of both ends (and if you weren't using*)
then it would be sensible to standardise on UTF-8. If not then you'll have to
find out what character encoding the application that created the archive was
using (or more likely assuming without thinking about it), and attempt to
decode that.

But now comes the /really really stupid/ bit. What encoding do you think the* assumes ? One sensible choice would be to treat the filenames
as "raw" character data, so you would at least be able to see the character
values that were used in the ZIP file names and could reverse-encode them.
Another choice, even more sensible (but more work), would be to allow us to
specify the charset used when decoding the binary form of the names. A rather
stupid idea would be to hard-code a charset. An especially stupid idea would
be to hard-code the choice of a variable-width charset since that could easily
make it impossible to read a valid file. A really, really, stupid idea would
be to hard-code a variable-width charset that nobody else on Earth ever uses.

Guess what Sun have done...

The* stuff is hardwired with the assumption that filenames in
ZIP files (and probably comments too, but I haven't checked) are encoded in the
weird non-standard, incompatible, modified version of UTF8 that is used
internally by some low-level JVM stuff (principally the JNI interface, and JVM
classfile format) ! Nobody else on Earth uses it for anything.

Why ? I dunno. I just can't get over how stupid it is...

Anyway, the upshot of that is that you will only be able to read ZIP filenames
correctly if one of the following apply:

1) The filenames are all 7-bit ASCII.

2) The filenames are Unicode, AND all the characters are in the 16-bit Unicode
range, AND the application that generated the ZIP file used UTF8 for filename
encoding. (That /should/ work over the limited range since nul-bytes are
either illegal or unlikely in file names).

3) The filenames are Unicode, AND the OS is Windows (which uses UTF-16 for
filenames), AND the application that wrote it was under the impression that
Unicode is 16-bit and therefore blindly converted each 16-bit quantity into 1,2
or 3, bytes in the same way as UTF-8 would.

4) The file was created with Sun's bloody idiotic*.

In point of fact, neither (2) nor (3) seem particularly unlikely to me, but
apparently neither case applies to your problem.

If the filenames are /not/ Unicode but are in the user's local charset (8-bit
or variable-width) then you are stuffed, I'm afraid. There's not even any
guarantee that the* will not throw errors when it tries to read
the filenames, since they may not be valid byte-sequences according to the
encoding that Sun has hard-wired. The same goes if the filenames are Unicode,
use characters outside the 8-bit range, and were written in correct UTF-8.

-- chris


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question