BadZipfile "file is not a zip file"

John Machin · Jan 9, 2009

and in another message webcomm wrote:
> I ran the diagnostic gadget...
>
> archive size is 69888
> FileHeader at 0
> CentralDir at 43796
> EndArchive at 43846

This is telling you that the archive ends at 43846,

Not quite. """In a "normal" uncommented archive, EndArchive_pos + 22
==
archive_size."""

but the file
is 69888 bytes long (69888 - 43846 = 26042 post-archive bytes).
Have you tried calling getzip(filename, ignoreable=30000)?
The whole point of the function is to ignore the nasty stuff at the
end, but if _I_ had a file with more than 25K of post-archive bytes,
I'd certainly try to figure out if the archive was mis-handled
somewhere along the way.

Me too. Further, if I wasn't "ever diplomatic"

, I wouldn't be
calling software (or people!) that blithely ignored 25kb of
unexplained data "forgiving" ... some other f-words, perhaps.

webcomm · Jan 9, 2009

Send us a sample of this file in question...

Here's a sample with some dummy data from the web service:
http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character

If I unzip it like this...
popen("unzip data.zip")
....then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

If I unzip it like this...
getzip('data.zip', ignoreable=30000)
....using Scott's function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
....then the bad characters are \x00 characters.

webcomm · Jan 9, 2009

If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

trying again to post the link re: FFFD characters...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#

John Machin · Jan 9, 2009

I ran the diagnostic gadget...

archive size is 69888
FileHeader at 0
CentralDir at 43796
EndArchive at 43846

Thanks. Would you mind spending a few minutes more on this so that we
can see if it's a problem that can be fixed easily, like the one that
Chris Mellon reported?

The above output says that there are 43868 (43846 + 22) bytes of
useable data. That leaves 69888 - 43868 = 26020 bytes of "comment" ...
rather large for a comment. Have you run a virus scanner over this
file?

At the end is an updated version of the diagnostic gadget. It explores
the "EndArchive" structure and the comment at the end, with a special
check for all '\0' (as per Chris's bug report) and another for all
blank. Please run it over your file and show us the results. Note: you
may want to suppress the display of the first 100 bytes of comment if
it turns out to be private data.

Cheers,
John

8<---
# zip_susser_v2.py
import sys
grimoire = [
("FileHeader", "PK\003\004"), # magic number for file
header
("DataDescriptor", "PK\x07\x08"), # see PKZIP APPNOTE (V) (C)
("CentralDir", "PK\001\002"), # magic number for central
directory
("EndArchive", "PK\005\006"), # magic number for end of
archive record
("EndArchive64", "PK\x06\x06"), # magic token for Zip64
header
("EndArchive64Locator", "PK\x06\x07"), # magic token for locator
header
("ArchiveExtraData", "PK\x06\x08"), # APPNOTE (V) (E)
("DigitalSignature", "PK\x05\x05"), # APPNOTE (V) (F)
]
f = open(sys.argv[1], 'rb')
buff = f.read()
f.close()
blen = len(buff)
print "archive size is", blen
for magic_name, magic in grimoire:
pos = 0
while pos < blen:
pos = buff.find(magic, pos)
if pos < 0:
break
print "%s at %d" % (magic_name, pos)
pos += 4
#
# find what is in the EndArchive struct
#
structEndArchive = "<4s4H2LH" # 9 [sic] items, end of archive, 22
bytes
import struct
posEndArchive = buff.find("PK\005\006")
print "using posEndArchive =", posEndArchive
assert 0 < posEndArchive < blen
endArchive = struct.unpack(structEndArchive, buff
[posEndArchive

osEndArchive+22])
print "endArchive:", repr(endArchive)
endArchiveFieldNames = """
signature
this_disk_num
central_dir_disk_num
central_dir_this_disk_num_entries
central_dir_overall_num_entries
central_dir_size
central_dir_offset
comment_size
""".split()
for name, value in zip(endArchiveFieldNames, endArchive):
print "%33s : %r" % (name, value)
#
# inspect the comment
#
actual_comment_size = blen - 22 - posEndArchive
expected_comment_size = endArchive[7]
comment = buff[posEndArchive + 22:]
print
print "expected_comment_size:", expected_comment_size
print "actual_comment_size:", actual_comment_size
print "comment is all spaces:", comment == ' ' * actual_comment_size
print "comment is all '\\0':", comment == '\0' * actual_comment_size
print "comment (first 100 bytes):", repr(comment[:100])
8<---

webcomm · Jan 9, 2009

Thanks. Would you mind spending a few minutes more on this so that we
can see if it's a problem that can be fixed easily, like the one that
Chris Mellon reported?

Don't mind at all. I'm now working with a zip file with some dummy
data I downloaded from the web service. You'll notice it's a smaller
archive than the one I was working with when I ran zip_susser.py, but
it has the same problem (whatever the problem is). It's the one I
uploaded to http://webcomm.webfactional.com/htdocs/data.zip

Here's what I get when I run zip_susser_v2.py...

archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
signature : 'PK\x05\x06'
this_disk_num : 0
central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
central_dir_overall_num_entries : 1
central_dir_size : 50
central_dir_offset : 844
comment_size : 0

expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
comment (first 100 bytes):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00'

Not sure if you've seen this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864?hl=en#

Thanks,
Ryan

MRAB · Jan 9, 2009

webcomm said:
Here's a sample with some dummy data from the web service:
http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character

If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

If I unzip it like this...
getzip('data.zip', ignoreable=30000)
...using Scott's function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.

I can unzip it in Windows XP. The file within it (called "data") is XML
encoded as UTF-16LE (2 bytes per character, low byte first), but without
the initial byte order mark. Python's zipfile module says "BadZipfile:
File is not a zip file".

MRAB · Jan 9, 2009

MRAB said:
I can unzip it in Windows XP. The file within it (called "data") is XML
encoded as UTF-16LE (2 bytes per character, low byte first), but without
the initial byte order mark. Python's zipfile module says "BadZipfile:
File is not a zip file".

If I strip off all but the last 4 zero-bytes then the zipfile module can
open it:

decoded = base64.b64decode(datum)
five_zeros = chr(0) * 5
while decoded.endswith(five_zeros):
decoded = decoded[ : -1]
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

John Machin · Jan 10, 2009

Don't mind at all. I'm now working with a zip file with some dummy
data I downloaded from the web service. You'll notice it's a smaller
archive than the one I was working with when I ran zip_susser.py, but
it has the same problem (whatever the problem is).

You mean it produces the same symptom. The zipfile.py has several
paths to the symptom i.e. the uninformative "bad zipfile" exception;
we don't know which path, yet. That's why Martin was suggesting that
you debug the sucker; that's why I'm trying to do it for you by remote
control. It is not impossible for a file with dummy data to have been
handcrafted or otherwise produced by a process different to that used
for a real-data file. Please run v2 of the gadget on the real-data zip
and report the results.

It's the one I
uploaded tohttp://webcomm.webfactional.com/htdocs/data.zip

Here's what I get when I run zip_susser_v2.py...

archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
signature : 'PK\x05\x06'
this_disk_num : 0
central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
central_dir_overall_num_entries : 1
central_dir_size : 50
central_dir_offset : 844
comment_size : 0

expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
comment (first 100 bytes):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00'

Not sure if you've seen this thread...http://groups.google.com/group/comp..lang.python/browse_thread/thread/...

Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick
with one thread ...

webcomm · Jan 10, 2009

It is not impossible for a file with dummy data to have been
handcrafted or otherwise produced by a process different to that used
for a real-data file.

I knew it was produced by the same process, or I wouldn't have shared
it. : )
But you couldn't have known that.

Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick
with one thread ...

Thanks... I thought I was posting about separate issues and would
annoy people who were only interested in one of the issues if I put
them both in the same thread. I guess all posts re: the same script
should go in one thread, even if the questions posed may be unrelated
and may be separate issues. There are grey areas.

Problem solved in John Machin's post at
http://groups.google.com/group/comp...8341539d87989?hl=en&lnk=raot#03b8341539d87989

I'll post the final code when it's prettier.

-Ryan

webcomm · Jan 12, 2009

If anyone's interested, here are my django views...

from django.shortcuts import render_to_response
from django.http import HttpResponse
from xml.etree.ElementTree import ElementTree
import urllib, base64, subprocess

def get_data(request):
service_url = 'http://www.something.com/webservices/someservice/
etc?user=etc&pass=etc'
xml = urllib.urlopen(service_url)
#the base64-encoded string is in a one-element xml doc...
tree = ElementTree()
xml_doc = tree.parse(xml)
datum = ""
for node in xml_doc.getiterator():
datum = "%s" % (node.text)
decoded = base64.b64decode(datum)

dir = '/path/to/data/'
f = open(dir+'data.zip', 'wb')
f.write(decoded)
f.close()

file = subprocess.call('unzip '+dir+'data.zip -d '+dir,
shell=True)
file = open(dir+'data', 'rb').read()
txt = file.decode('utf_16_le')

return render_to_response('output.html',{
'output' : txt
})

def read_xml(request):
xml = urllib.urlopen('http://www.something.org/get_data/') #page
using the get_data view
xml = xml.read()
xml = unicode(xml)
xml = '<?xml version="1.0" encoding="UTF-8"?>\n<stuff>'+xml+'</
stuff>'

f = open('/path/to/temp.txt','w')
f.write(xml)
f.close()

tree = ElementTree()
xml_doc = tree.parse('/path/to/temp.txt')
datum = ""
for node in xml_doc.getiterator():
datum = "%s<br />%s - %s" % (datum, node.tag, node.text)

return render_to_response('output.html',{
'output' : datum
})

Chris Mellon · Jan 12, 2009

I knew it was produced by the same process, or I wouldn't have shared
it. : )
But you couldn't have known that.

Thanks... I thought I was posting about separate issues and would
annoy people who were only interested in one of the issues if I put
them both in the same thread. I guess all posts re: the same script
should go in one thread, even if the questions posed may be unrelated
and may be separate issues. There are grey areas.

Problem solved in John Machin's post at
http://groups.google.com/group/comp...8341539d87989?hl=en&lnk=raot#03b8341539d87989

It's worth pointing out (although the provider probably doesn't care)
that this isn't really an XML document and this was a bad way of them
to distribute the data. If they'd used a correctly formatted XML
document (with the prelude and everything) with the correct encoding
information, existing XML parsers should have just Done The Right
Thing with the data, instead of you needing to know the encoding a
priori to extract an XML fragment.

webcomm · Jan 12, 2009

It's worth pointing out (although the provider probably doesn't care)
that this isn't really an XML document and this was a bad way of them
to distribute the data. If they'd used a correctly formatted XML
document (with the prelude and everything) with the correct encoding
information, existing XML parsers should have just Done The Right
Thing with the data, instead of you needing to know the encoding a
priori to extract an XML fragment.

Agreed. I can't say I understand their rationale for doing it this way.

Steve Holden · Jan 13, 2009

webcomm said:
On Jan 12, 11:53 am, "Chris Mellon" <[email protected]> wrote:

[file distribution horror story ...]

Agreed. I can't say I understand their rationale for doing it this way.

Sadly their rationale is irrelevant to the business of making sense of
the data, which we all hope you have eventually managed to do.

regards
Steve

bikonec · Jun 24, 2022

Many programmers suffer from this problem. This link ( https://kodlogs.net/326/zipfile-badzipfile-file-is-not-a-zip-file ) contains the best solution I've ever seen. I got the solution from here, I think it will be very useful for you. Many thanks to the members of this website for discussing this topic.

distinction between unzipping bytes and unzipping a file	10	Jan 9, 2009
packaging python code in zip file	1	Dec 9, 2010
Error with co_filename when loading modules from zip file	6	Mar 5, 2012
zipfile decompress problems	5	Jan 16, 2006
virtualenv problem on win32	1	Apr 13, 2011
compile directly into a .zip/.jar file?	22	Apr 12, 2014
unzip Zip file in a remote drive	1	Jun 5, 2013
How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	1	Mar 22, 2024

BadZipfile "file is not a zip file"

John Machin

webcomm

webcomm

John Machin

webcomm

MRAB

MRAB

John Machin

webcomm

webcomm

Chris Mellon

webcomm

Steve Holden

bikonec

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads