really slow gzip decompress, why?

R

redbaron

I've one big (6.9 Gb) .gz file with text inside it.
zcat bigfile.gz > /dev/null does the job in 4 minutes 50 seconds

python code have been doing the same job for 25 minutes and still
doesn't finish =( the code is simpliest I could ever imagine:

def main():
fh = gzip.open(sys.argv[1])
all(fh)

As far as I understand most of the time it executes C code, so pythons
no overhead should be noticible. Why is it so slow?
 
D

Diez B. Roggisch

redbaron said:
I've one big (6.9 Gb) .gz file with text inside it.
zcat bigfile.gz > /dev/null does the job in 4 minutes 50 seconds

python code have been doing the same job for 25 minutes and still
doesn't finish =( the code is simpliest I could ever imagine:

def main():
fh = gzip.open(sys.argv[1])
all(fh)

As far as I understand most of the time it executes C code, so pythons
no overhead should be noticible. Why is it so slow?

I'm guessing here - but if gzip streams (and AFAIK it does), the commandline
will simply stream to /dev/null.

OTOH, python is not streaming, it will instead allocate buffers for the
whole file. Which for a *zipped* 6.9Gb file might take a while.

Diez
 
J

Jeff McNeil

I've one big (6.9 Gb) .gz file with text inside it.
zcat bigfile.gz > /dev/null does the job in 4 minutes 50 seconds

python code have been doing the same job for 25 minutes and still
doesn't finish =( the code is simpliest I could ever imagine:

def main():
fh = gzip.open(sys.argv[1])
all(fh)

As far as I understand most of the time it executes C code, so pythons
no overhead should be noticible. Why is it so slow?

Look what's happening in both operations. The zcat operation is simply
uncompressing your data and dumping directly to /dev/null. Nothing is
done with the data as it's uncompressed.

On the other hand, when you call 'all(fh)', you're iterating through
every element in in bigfile.gz. In other words, you're reading the
file and scanning it for newlines versus simply running the
decompression operation.
 
J

Jeff McNeil

I've one big (6.9 Gb) .gz file with text inside it.
zcat bigfile.gz > /dev/null does the job in 4 minutes 50 seconds
python code have been doing the same job for 25 minutes and still
doesn't finish =( the code is simpliest I could ever imagine:
def main():
fh = gzip.open(sys.argv[1])
all(fh)
As far as I understand most of the time it executes C code, so pythons
no overhead should be noticible. Why is it so slow?

Look what's happening in both operations. The zcat operation is simply
uncompressing your data and dumping directly to /dev/null. Nothing is
done with the data as it's uncompressed.

On the other hand, when you call 'all(fh)', you're iterating through
every element in in bigfile.gz. In other words, you're reading the
file and scanning it for newlines versus simply running the
decompression operation.

The File:
----------------------------------------------------
[jeff@marvin ~]$ ls -alh junk.gz
-rw-rw-r-- 1 jeff jeff 113M 2009-01-26 10:42 junk.gz
[jeff@marvin ~]$

The 'zcat' time:
----------------------------------------------------
[jeff@marvin ~]$ time zcat junk.gz > /dev/null

real 0m2.390s
user 0m2.296s
sys 0m0.093s
[jeff@marvin ~]$


Test Script #1:
----------------------------------------------------
import sys
import gzip

fs = gzip.open('junk.gz')
data = fs.read(8192)
while data:
sys.stdout.write(data)
data = fs.read(8192)


Test Script #1 Time:
----------------------------------------------------
[jeff@marvin ~]$ time python test9.py >/dev/null

real 0m3.681s
user 0m3.201s
sys 0m0.478s
[jeff@marvin ~]$


Test Script #2:
----------------------------------------------------
import sys
import gzip

fs = gzip.open('junk.gz')
all(fs)


Test Script #2 Time:
----------------------------------------------------
[jeff@marvin ~]$ time python test10.py

real 1m51.764s
user 1m51.475s
sys 0m0.245s
[jeff@marvin ~]$
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,149
Latest member
Vinay Kumar Nevatia0
Top