Unzip: Memory Error

M

mcl

I am trying to unzip an 18mb zip containing just a single 200mb file
and I get a Memory Error. When I run the code on a smaller file 1mb
zip, 11mb file, it works fine.

I am running on a hosted Apache web server

I am using some code I found on the web somewhere.


def unzip_file_into_dir(file, dir):
#os.mkdir(dir, 0777)
zfobj = zipfile.ZipFile(file)
for name in zfobj.namelist():
if name.endswith('/'):
os.mkdir(os.path.join(dir, name))
else:
outfile = open(os.path.join(dir, name), 'wb')
outfile.write(zfobj.read(name))
outfile.close()


Error Traceback: Line 357 gives the Memory Error
I have removed paths from file references
==============================================

MemoryError Python 2.3.4: /usr/bin/python
Wed Aug 29 19:38:22 2007

A problem occurred in a Python script. Here is the sequence of
function calls leading up to the error, in the order they occurred.

/qlunzip.py
58
59
60 if __name__ == "__main__" :
61 unzipMain()
62
unzipMain = <function unzipMain>

/qlunzip.py in unzipMain()
53 destdir = getDestDir()
54 print destdir, gv.nl
55 unzip_file_into_dir(zips, destdir)
56
57
global unzip_file_into_dir = <function unzip_file_into_dir>, zips = '/
pcodes.zip', destdir = '/pcodes/'

/qlunzip.py in unzip_file_into_dir(file='pcodes.zip', dir='pcodes/')
34 else:
35 outfile = open(os.path.join(dir, name),
'wb')
36 outfile.write(zfobj.read(name))
37 outfile.close()
38
outfile = <open file 'pcodes/pcodes.lst', mode 'wb'>, outfile.write =
<built-in method write of file object>, zfobj = <zipfile.ZipFile
instance>, zfobj.read = <bound method ZipFile.read of <zipfile.ZipFile
instance>>, name = 'pcodes.lst'

/usr/lib/python2.3/zipfile.py in read(self=<zipfile.ZipFile
instance>, name='pcodes.lst')
355 # zlib compress/decompress code by Jeremy Hylton of
CNRI
356 dc = zlib.decompressobj(-15)
357 bytes = dc.decompress(bytes)
358 # need to feed in unused pad byte so that zlib won't
choke
359 ex = dc.decompress('Z') + dc.flush()
bytes = '\xc4\x9d]\x93\xab8\x92\x86\xef7b\xff\x83\xa3/\xf6f\xba
\xa7\xe7\xa2g#vwf6\x8a\x02\xc3\xd04\x8d\r\x1e\x7f\xdclP\xb6\xcav\x1c
\xca\xd4`\xfbx...; \xb7jp\x06V{\xaf\xc3\xa5\xa7;\xdd\xd2\xaaD\x7f)c
\xc6\x9d\x0f\xf2\xff-\xc9\x92\xc3\x1d\xa4`\xe0\xb8\x06)\x188\x9cA\n
\x06\x8e\x1bPc\xf8\xf0\x1f', dc = <zlib.Decompress object>,
dc.decompress = <built-in method decompress of zlib.Decompress object>

MemoryError:
args = ()
============================================================

Any help much appreciated

Richard
 
D

David Bolen

mcl said:
I am trying to unzip an 18mb zip containing just a single 200mb file
and I get a Memory Error. When I run the code on a smaller file 1mb
zip, 11mb file, it works fine. (...)
def unzip_file_into_dir(file, dir):
#os.mkdir(dir, 0777)
zfobj = zipfile.ZipFile(file)
for name in zfobj.namelist():
if name.endswith('/'):
os.mkdir(os.path.join(dir, name))
else:
outfile = open(os.path.join(dir, name), 'wb')
outfile.write(zfobj.read(name))
outfile.close()

The "zfobj.read(name)" call is reading the entire file out of the zip
into a string in memory. It sounds like it's exceeding the resources
you have available (whether overall or because the Apache runtime
environment has stricter limits).

You may want to peek at a recent message from me in the "Unable to
read large files from zip" thread, as the suggestion there may also be
suitable for your purposes.

http://groups.google.com/group/comp.lang.python/msg/de04105c170fc805?dmode=source
-- David
 
M

mcl

The "zfobj.read(name)" call is reading the entire file out of the zip
into a string in memory. It sounds like it's exceeding the resources
you have available (whether overall or because the Apache runtime
environment has stricter limits).

You may want to peek at a recent message from me in the "Unable to
read large files from zip" thread, as the suggestion there may also be
suitable for your purposes.

http://groups.google.com/group/comp.lang.python/msg/de04105c170fc805?...
-- David

David,

Thank you. I read your post and I basically understood the concept,
butI could not get my head around the code, I need to write for my
solution. (Newbie and a bit long in the tooth)

To solve my problem, I think my best approach would be to read my
zipped file / files from the zip archive when I need them. Max three
users, occasional use. So no big overloading of host's server.

pseudo code

zfhdl = zopen(zip,filename) # Open File in Zip Archive for
Reading

while True:
ln = zfhdl.readline() # Get nextline of file
if not ln: # if EOF file
break
dealwithline(ln) # do whatever is necessary with
file
zfhdl.close

That is probably over simplified, and probably wrong but you may get
the idea of what I am trying to achieve.

Richard
 
D

David Bolen

mcl said:
pseudo code

zfhdl = zopen(zip,filename) # Open File in Zip Archive for
Reading

while True:
ln = zfhdl.readline() # Get nextline of file
if not ln: # if EOF file
break
dealwithline(ln) # do whatever is necessary with
file
zfhdl.close

That is probably over simplified, and probably wrong but you may get
the idea of what I am trying to achieve.

Do you have to process the file as a textual line-by-line file? Your
original post showed code that just dumped the file to the filesystem.
If you could back up one step further and describe the final operation
you need to perform it might be helpful.

If you are going to read the file data incrementally from the zip file
(which is what my other post provided) you'll prevent the huge memory
allocations and risk of running out of resource, but would have to
implement your own line ending support if you then needed to process
that data in a line-by-line mode. Not terribly hard, but more
complicated than my prior sample which just returned raw data chunks.

Depending on your application need, it may still be simpler to just
perform an extraction of the file to temporary filesystem space (using
my prior code for example) and then open it normally.

-- David
 
D

David Bolen

David Bolen said:
If you are going to read the file data incrementally from the zip file
(which is what my other post provided) you'll prevent the huge memory
allocations and risk of running out of resource, but would have to
implement your own line ending support if you then needed to process
that data in a line-by-line mode. Not terribly hard, but more
complicated than my prior sample which just returned raw data chunks.

Here's a small example of a ZipFile subclass (tested a bit this time)
that implements two generator methods:

read_generator Yields raw data from the file
readline_generator Yields "lines" from the file (per splitlines)

It also corrects my prior code posting which didn't really skip over
the file header properly (due to the variable sized name/extra
fields). Needs Python 2.3+ for generator support (or 2.2 with
__future__ import)

Peak memory use is set "roughly" by the optional chunk parameter.
It's roughly since that's an uncompressed chunk so will grow in memory
during decompression. And the readline generator adds further copies
for the data split into lines.

For your file processing by line, it could be used as in:

zipf = ZipFileGen('somefile.zip')

g = zipf.readline_generator('somefilename.txt')
for line in g:
dealwithline(line)

zipf.close()

Even if not a perfect match, it should point you further in the right
direction.

-- David

- - - - - - - - - - - - - - - - - - - - - - - - -

import zipfile
import zlib
import struct

class ZipFileGen(zipfile.ZipFile):

def read_generator(self, name, chunk=65536):
"""Return a generator that yields file bytes for name incrementally.
The optional chunk parameter controls the chunk size read from the
underlying zip file. For compressed files, the data length returned
by the generator will be larger as the decompressed version of a chunk.

Note that unlike read(), this method does not preserve the internal
file pointer and should not be mixed with write operations. Nor does
it verify that the ZipFile is still opened and for reading.

Multiple generators returned by this function are not designed to be
used simultaneously (they do not re-seek the underlying file for
each request."""

zinfo = self.getinfo(name)
compressed = (zinfo.compress_type == zipfile.ZIP_DEFLATED)
if compressed:
dc = zlib.decompressobj(-15)

self.fp.seek(zinfo.header_offset)

# Skip the file header (from zipfile.ZipFile.read())
fheader = self.fp.read(30)
if fheader[0:4] != zipfile.stringFileHeader:
raise zipfile.BadZipfile, "Bad magic number for file header"

fheader = struct.unpack(zipfile.structFileHeader, fheader)
fname = self.fp.read(fheader[zipfile._FH_FILENAME_LENGTH])
if fheader[zipfile._FH_EXTRA_FIELD_LENGTH]:
self.fp.read(fheader[zipfile._FH_EXTRA_FIELD_LENGTH])

# Process the file incrementally
remain = zinfo.compress_size
while remain:
bytes = self.fp.read(min(remain, chunk))
remain -= len(bytes)
if compressed:
bytes = dc.decompress(bytes)
yield bytes

if compressed:
bytes = dc.decompress('Z') + dc.flush()
if bytes:
yield bytes


def readline_generator(self, name, chunk=65536):
"""Return a generator that yields lines from a file within the zip
incrementally. Line ending detection based on splitlines(), and
like file.readline(), the returned line does not include the line
ending. Efficiency not guaranteed if used with non-textual files.

Uses a read_generator() generator to retrieve file data incrementally,
so it inherits the limitations of that method as well, and the
optional chunk parameter is passed to read_generator unchanged."""

partial = ''
g = self.read_generator(name, chunk=chunk)

for bytes in g:
# Break current chunk into lines
lines = bytes.splitlines()

# Add any prior partial line to first line
if partial:
lines[0] = partial + lines[0]

# If the current chunk didn't happen to break on a line ending,
# save the partial line for next time
if bytes[-1] not in ('\n', '\r'):
partial = lines.pop()

# Then yield the lines we've identified so far
for curline in lines:
yield curline

# Return any trailing data (if file didn't end in a line ending)
if partial:
yield partial
 
D

David Bolen

I said:
Here's a small example of a ZipFile subclass (tested a bit this time)
that implements two generator methods:

Argh, not quite tested enough - one fix needed, change:

if bytes[-1] not in ('\n', '\r'):
partial = lines.pop()

to:

if bytes[-1] not in ('\n', '\r'):
partial = lines.pop()
else:
partial = ''

(add the extra two lines)

-- David
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,091
Latest member
PeakCBDIngredients

Latest Threads

Top