Python vs. Java gzip performance

B

Bill

I've written a small program that, in part, reads in a file and parses
it. Sometimes, the file is gzipped. The code that I use to get the
file object is like so:

if filename.endswith(".gz"):
file = GzipFile(filename)
else:
file = open(filename)

Then I parse the contents of the file in the usual way (for line in
file:...)

The equivalent Java code goes like this:

if (isZipped(aFile)) {
input = new BufferedReader(new InputStreamReader(new
GZIPInputStream(new FileInputStream(aFile)));
} else {
input = new BufferedReader(new FileReader(aFile));
}

Then I parse the contents similarly to the Python version (while
nextLine = input.readLine...)

The Java version of this code is roughly 2x-3x faster than the Python
version. I can get around this problem by replacing the Python
GzipFile object with a os.popen call to gzcat, but then I sacrifice
portability. Is there something that can be improved in the Python
version?

Thanks -- Bill.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bill said:
The Java version of this code is roughly 2x-3x faster than the Python
version. I can get around this problem by replacing the Python
GzipFile object with a os.popen call to gzcat, but then I sacrifice
portability. Is there something that can be improved in the Python
version?

Don't use readline/readlines. Instead, read in larger chunks, and break
it into lines yourself. For example, if you think the entire file should
fit into memory, read it at once.

If that helps, try editing gzip.py to incorporate that approach.

Regards,
Martin
 
C

Caleb Hattingh

I tried this:

from timeit import *

#Try readlines
print Timer('import
gzip;lines=gzip.GzipFile("gztest.txt.gz").readlines();[i+"1" for i in
lines]').timeit(200) # This is one line


# Try file object - uses buffering?
print Timer('import gzip;[i+"1" for i in
gzip.GzipFile("gztest.txt.gz")]').timeit(200) # This is one line

Produces:

3.90938591957
3.98982691765

Doesn't seem much difference, probably because the test file easily
gets into memory, and so disk buffering has no effect. The file
"gztest.txt.gz" is a gzipped file with 1000 lines, each being "This is
a test file".
 
P

Peter Otten

Caleb said:
I tried this:

from timeit import *

#Try readlines
print Timer('import
gzip;lines=gzip.GzipFile("gztest.txt.gz").readlines();[i+"1" for i in
lines]').timeit(200) # This is one line


# Try file object - uses buffering?
print Timer('import gzip;[i+"1" for i in
gzip.GzipFile("gztest.txt.gz")]').timeit(200) # This is one line

Produces:

3.90938591957
3.98982691765

Doesn't seem much difference, probably because the test file easily
gets into memory, and so disk buffering has no effect. The file
"gztest.txt.gz" is a gzipped file with 1000 lines, each being "This is
a test file".

$ python -c"file('tmp.txt', 'w').writelines('%d This is a test\n' % n for n
in range(1000))"
$ gzip tmp.txt

Now, if you follow Martin's advice:

$ python -m timeit -s"from gzip import GzipFile"
"GzipFile('tmp.txt.gz').readlines()"
10 loops, best of 3: 20.4 msec per loop

$ python -m timeit -s"from gzip import GzipFile"
"GzipFile('tmp.txt.gz').read().splitlines(True)"
1000 loops, best of 3: 534 usec per loop

Factor 38. Not bad, I'd say :)

Peter
 
A

Andrew MacIntyre

Bill said:
I've written a small program that, in part, reads in a file and parses
it. Sometimes, the file is gzipped. The code that I use to get the
file object is like so:

if filename.endswith(".gz"):
file = GzipFile(filename)
else:
file = open(filename)

Then I parse the contents of the file in the usual way (for line in
file:...)

The equivalent Java code goes like this:

if (isZipped(aFile)) {
input = new BufferedReader(new InputStreamReader(new
GZIPInputStream(new FileInputStream(aFile)));
} else {
input = new BufferedReader(new FileReader(aFile));
}

Then I parse the contents similarly to the Python version (while
nextLine = input.readLine...)

The Java version of this code is roughly 2x-3x faster than the Python
version. I can get around this problem by replacing the Python
GzipFile object with a os.popen call to gzcat, but then I sacrifice
portability. Is there something that can be improved in the Python
version?

The gzip module is implemented in Python on top of the zlib module. If
you peruse its source (particularly the readline() method of the GzipFile
class) you might get an idea of what's going on.

popen()ing a gzcat source achieves better performance by shifting the
decompression to an asynchronous execution stream (separate process)
while allowing the standard Python file object's optimised readline()
implementation (in C) to do the line splitting (which is done in Python
code in GzipFile).

I suspect that Java approach probably implements a similar approach
under the covers using threads.

Short of rewriting the gzip module in C, you may get some better
throughput by using a slightly lower level approach to parsing the file:

while 1:
line = z.readline(size=4096)
if not line:
break
... # process line here

This is probably only likely to be of use for files (such as log files)
with lines longer that the 100 character default in the readline()
method. More intricate approaches using z.readlines(sizehint=<size>)
might also work.

If you can afford the memory, approaches that read large chunks from the
gzipped stream then line split in one low level operation (so that the
line splitting is mostly done in C code) are the only way to lift
performance.

To me, if the performance matters, using popen() (or better: the
subprocess module) isn't so bad; it is actually quite portable
except for the dependency on gzip (probably better to use "gzip -dc"
rather than "gzcat" to maximise portability though). gzip is available
for most systems, and the approach is easily modified to use bzip2 as
well (though Python's bz2 module is implemented totally in C, and so
probably doesn't have the performance issues that gzip has).
 
S

Serge Orlov

Bill said:
Is there something that can be improved in the Python version?

Seems like GzipFile.readlines is not optimized, file.readline works
better:

C:\py>python -c "file('tmp.txt', 'w').writelines('%d This is a test\n'
% n for n in range(10000))"

C:\py>python -m timeit "open('tmp.txt').readlines()"
100 loops, best of 3: 2.72 msec per loop

C:\py>python -m timeit "open('tmp.txt').readlines(1000000)"
100 loops, best of 3: 2.74 msec per loop

C:\py>python -m timeit "open('tmp.txt').read().splitlines(True)"
100 loops, best of 3: 2.79 msec per loop

Workaround has been posted already.

-- Serge.
 
C

Caleb Hattingh

Hi Peter

Clearly I misunderstood what Martin was saying :) I was comparing
operations on lines via the file generator against first loading the
file's lines into memory, and then performing the concatenation.

What does ".readlines()" do differently that makes it so much slower
than ".read().splitlines(True)"? To me, the "one obvious way to do it"
is ".readlines()".

Caleb
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Caleb said:
What does ".readlines()" do differently that makes it so much slower
than ".read().splitlines(True)"? To me, the "one obvious way to do it"
is ".readlines()".

readlines reads 100 bytes (at most) at a time. I'm not sure why it
does that (probably in order to not read further ahead than necessary
to get a line (*)), but for gzip, that is terribly inefficient. I
believe the gzip algorithms use a window size much larger than that -
not sure how the gzip library deals with small reads.

One interpretation would be that gzip decompresses the current block
over an over again if the caller only requests 100 bytes each time.
This is a pure guess - you would need to read the zlib source code
to find out.

Anyway, decompressing the entire file at one lets zlib operate at the
highest efficiency.

Regards,
Martin

(*) Guessing further, it might be that "read a lot" fails to work well
on a socket, as you would have to wait for the complete data before
even returning the first line.

P.S. Contributions to improve this are welcome.
 
F

Fulvio

Hello,

I'm very new of Python programming. I just wrote some hundred lines of a
programm.
Now, I'd like to go some step farther and make a disk cataloger. There are
plenty for win, but few for linux. So, I'd like to write one which is for win
and linux.
I'm, actually, a bit stuck on how to collect informations regarding disk names
(CDroms or USB HDs).
The matter is rather difficult if is suppose to make the programm running for
linux as much as it would do for MSW.

Suggestions are very welcome.

Fulvio
 
F

Felipe Almeida Lessa

Em Qua, 2006-03-22 às 00:47 +0100, "Martin v. Löwis" escreveu:
Caleb said:
What does ".readlines()" do differently that makes it so much slower
than ".read().splitlines(True)"? To me, the "one obvious way to do it"
is ".readlines()".
[snip]
Anyway, decompressing the entire file at one lets zlib operate at the
highest efficiency.

Then there should be a fast-path on readlines like this:

def readlines(self, sizehint=None):
if sizehint is None:
return self.read().splitlines(True)
# ...

Is it okay? Or is there any embedded problem I couldn't see?
 
S

Sybren Stuvel

Fulvio enlightened us with:
Now, I'd like to go some step farther and make a disk cataloger.

What kind of disk? Harddisks? DVDs? Audio CDs?
I'm, actually, a bit stuck on how to collect informations regarding
disk names (CDroms or USB HDs).

Depends on what names you want. Filenames? Track names? Artist names?
Filesystem labels?

Sybren
 
F

Fulvio

Alle 21:22, mercoledì 22 marzo 2006, Sybren Stuvel ha scritto:
Depends on what names you want.

It seems clear that was _disk_ names. If isn't to much would be also useful to
know the serial number, so will avoid to record a disk twice. On Win, we can
call Win32API, but this won't work for linux :-(
It's my opinion that python should read some partition table or disk header to
find these informations.

For the remaining information will suffice a os.walk(path). OK I'm not 100%
sure what's that function, but I remember that there's for win and lin. Later
I'll go deeper for info according the MIME file type.

Fulvio
 
S

Sybren Stuvel

Fulvio enlightened us with:
Alle 21:22, mercoledì 22 marzo 2006, Sybren Stuvel ha scritto:

It seems clear that was _disk_ names.

What's a disk name? The filesystem label works as a disk name for
ISO-9660 CDROMs, but entire harddisks have no disk name - the
different partitions might have, though. Then again, it all depends on
the filesystems in use.
Later I'll go deeper for info according the MIME file type.

See the mimetypes module for that.

Sybren
 
G

Guest

Felipe said:
def readlines(self, sizehint=None):
if sizehint is None:
return self.read().splitlines(True)
# ...

Is it okay? Or is there any embedded problem I couldn't see?

It's dangerous, if the file is really large - it might exhaust
your memory. Such a setting shouldn't be the default.

Somebody should research what blocking size works best for zipfiles,
and then compare that in performance to "read it all at once".

It would be good if the rationale for using at most 100 bytes at
a time could be discovered.

Regards,
Martin
 
F

Fulvio

Alle 22:14, mercoledì 22 marzo 2006, Sybren Stuvel ha scritto:
different partitions might have, though. Then again, it all depends on
the filesystems in use.
Then I should make some extra programming to catch these info, according to
which OS will run it :-(
Regarding the names, CDROMs DVD have a label, which might be written during
the burning process. Also partition have name, which can be written by fdisk,
or in MS windows properties.
Just a comparison : Imaging something like WhereIsIt (Windows) Gwhere
(Linux), but in my opinion I'd like to give it free and less weak (as the
before-mentioned programs).

F
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top