Compress::Zlib vs. external gzip call

O

odigity

I'm writing a script that needs to run in as fast a time as possible.
Every minute counts. The script crawls a tree of gzipped files
totalling about 30gb. Originally I was calling open() with "gzip
$file |", but I hate making external calls - it requires a fork, and
you have very limited communication with the process for catching
errors and such. I always like using perl functions and modules when
possible over external calls. However, I wanted to make sure I
wouldn't take a performance hit before switching to Compress::Zlib.

I picked one of the bigger files (75mb) and ran some benchmarking on
it, comparing Compress::Zlib to an exernal call to the gzip utility.
Here's the code:

#!/usr/bin/perl -w
use strict;
use Benchmark qw( cmpthese );
use Compress::Zlib;
use IO::File;

my $file = 'sample.gz';

print "warming up the file...\n";
system( "zcat $file > /dev/null" );

print "starting comparison...\n";
cmpthese( 3, {
'ext_gzip' => \&ext_gzip,
'compress_zlib' => \&compress_zlib,
});


sub ext_gzip
{
my $fh = IO::File->new( "gzip -cd $file |" ) or die( "could not gzip
-cd '$file' for reading: $!" );
my $lines = 0;
while ( defined(my $line = $fh->getline()) ) {
$lines++;
}
$fh->close();
print "ext_gzip: $lines lines\n";
}


sub compress_zlib
{
my $gz = gzopen( $file, 'rb' ) or die( "could not gzopen '$file' for
reading: $!" );
my $line;
my $lines = 0;
while ( ( my $bytes = $gz->gzreadline( $line ) ) > 0 ) {
die( $gz->gzerror ) if ( $bytes == -1 );
$lines++;
}
$gz->gzclose();
print "compress_zlib: $lines lines\n";
}

Here's the output:

warming up the file...
starting comparison...
compress_zlib: 15185003 lines
compress_zlib: 15185003 lines
compress_zlib: 15185003 lines
(warning: too few iterations for a reliable count)
ext_gzip: 15185003 lines
ext_gzip: 15185003 lines
ext_gzip: 15185003 lines
(warning: too few iterations for a reliable count)
s/iter compress_zlib ext_gzip
compress_zlib 68.6 -- -23%
ext_gzip 52.8 30% --

Now, this wasn't the best possible benchmarking test, but I still
think I am justified in being concerned.

Any help in either a) interpreting these results, b) suggesting better
benchmarking methods, c) explaining why Compress::Zlib is slower than
gzip, and most importantly, d) how to improve performance, would be
appreciated.

-ofer
 
S

Stuart Moore

odigity said:
I'm writing a script that needs to run in as fast a time as possible.
Every minute counts. The script crawls a tree of gzipped files
totalling about 30gb. Originally I was calling open() with "gzip
$file |", but I hate making external calls - it requires a fork, and
you have very limited communication with the process for catching
errors and such. I always like using perl functions and modules when
possible over external calls. However, I wanted to make sure I
wouldn't take a performance hit before switching to Compress::Zlib.

Just thinking out loud here:
- Would the time measured by "Benchmark" include the time to start gzip?
Does it measure total time, or just time when the perl process is using
the CPU? Do the times mentioned match what you'd get with a stopwatch?

- Might it be worth looking at some of the smaller files as well,
possibly the time taken to open gzip is less significant on the large
ones than the small ones?

- Is there any way you can keep the gzip process open and only call it
once to decompress multiple files? One fork is better than many
 
S

Sisyphus

odigity said:
sub ext_gzip
{
my $fh = IO::File->new( "gzip -cd $file |" ) or die( "could not gzip
-cd '$file' for reading: $!" );
my $lines = 0;
while ( defined(my $line = $fh->getline()) ) {
$lines++;
}
$fh->close();
print "ext_gzip: $lines lines\n";
}


sub compress_zlib
{
my $gz = gzopen( $file, 'rb' ) or die( "could not gzopen '$file' for
reading: $!" );
my $line;
my $lines = 0;
while ( ( my $bytes = $gz->gzreadline( $line ) ) > 0 ) {

The next line is a waste of time. If $bytes is -1 then the code inside
the loop will not be executed. Also this is one test that the other
subroutine doesn't have to do. I don't think, however, that it will
account for the entire time difference .... remove it and see.

I also wonder whether there is more overhead in determining whether
$bytes>0 than there is determining whether $line is defined.

And I don't know how 'getline()' and 'gzreadline()' compare - both in
terms of what they actually do, and in terms of how fast they do it.
die( $gz->gzerror ) if ( $bytes == -1 );
$lines++;
}
$gz->gzclose();
print "compress_zlib: $lines lines\n";
}

Cheers,
Rob
 
A

Anno Siegel

odigity said:
I'm writing a script that needs to run in as fast a time as possible.
Every minute counts. The script crawls a tree of gzipped files
totalling about 30gb. Originally I was calling open() with "gzip
$file |", but I hate making external calls - it requires a fork, and
you have very limited communication with the process for catching
errors and such. I always like using perl functions and modules when
possible over external calls. However, I wanted to make sure I
wouldn't take a performance hit before switching to Compress::Zlib.

I picked one of the bigger files (75mb) and ran some benchmarking on
it, comparing Compress::Zlib to an exernal call to the gzip utility.
Here's the code:

#!/usr/bin/perl -w
use strict;
use Benchmark qw( cmpthese );
use Compress::Zlib;
use IO::File;

my $file = 'sample.gz';

print "warming up the file...\n";
system( "zcat $file > /dev/null" );

print "starting comparison...\n";
cmpthese( 3, {
'ext_gzip' => \&ext_gzip,
'compress_zlib' => \&compress_zlib,
});


sub ext_gzip
{
my $fh = IO::File->new( "gzip -cd $file |" ) or die( "could not gzip
-cd '$file' for reading: $!" );
my $lines = 0;
while ( defined(my $line = $fh->getline()) ) {
$lines++;
}
$fh->close();
print "ext_gzip: $lines lines\n";
}


sub compress_zlib
{
my $gz = gzopen( $file, 'rb' ) or die( "could not gzopen '$file' for
reading: $!" );
my $line;
my $lines = 0;
while ( ( my $bytes = $gz->gzreadline( $line ) ) > 0 ) {
die( $gz->gzerror ) if ( $bytes == -1 );
$lines++;
}
$gz->gzclose();
print "compress_zlib: $lines lines\n";
}

Here's the output:

warming up the file...
starting comparison...
compress_zlib: 15185003 lines
compress_zlib: 15185003 lines
compress_zlib: 15185003 lines
(warning: too few iterations for a reliable count)
ext_gzip: 15185003 lines
ext_gzip: 15185003 lines
ext_gzip: 15185003 lines
(warning: too few iterations for a reliable count)
s/iter compress_zlib ext_gzip
compress_zlib 68.6 -- -23%
ext_gzip 52.8 30% --

Now, this wasn't the best possible benchmarking test, but I still
think I am justified in being concerned.

I'm afraid the benchmark is useless. Benchmark doesn't count the
CPU time spent in children, so you're not catching the interesting
part in ext_gzip.

Anno
 
O

odigity

Stuart Moore said:
Just thinking out loud here:
- Would the time measured by "Benchmark" include the time to start gzip?
Does it measure total time, or just time when the perl process is using
the CPU? Do the times mentioned match what you'd get with a stopwatch?

I'm not sure if Benchmark is capable of supervising child processes
off the top of my head. I probably need to take that into account and
just use straight clocktime and enough iterations to smooth out system
behaviour.
- Might it be worth looking at some of the smaller files as well,
possibly the time taken to open gzip is less significant on the large
ones than the small ones?

Perhaps... most of the files are small, but I think most of the time
is spent on the few big files. And I also simply wanted to determine
which was faster at actual decompression. Still, a valid point.
- Is there any way you can keep the gzip process open and only call it
once to decompress multiple files? One fork is better than many

Hmm... I suppose I could use open2 to connect to both STDIN and STDOUT
and keep feeding it, but then I'd have to read the files myself into
the perl environment and print it out to the gzip process, which I'd
bet money will be slower. And there are too many files to build a
list and shove them onto a single command line. Man gzip reveals no
option for fetching a list of files from the command line.
 
O

odigity

Sisyphus said:
The next line is a waste of time. If $bytes is -1 then the code inside
the loop will not be executed. Also this is one test that the other
subroutine doesn't have to do. I don't think, however, that it will
account for the entire time difference .... remove it and see.

Yes; I rearranged my code a few times before settling on a pattern I
like, and that bug remained as a consequence. I don't think a scalar
comparison operation is going to have a noticeable effect relative to
the cost of reading from disk, decompressing data, and copying it
around in memory.
I also wonder whether there is more overhead in determining whether
$bytes>0 than there is determining whether $line is defined.

Benchmark it! :) But I don't think it matters here.
And I don't know how 'getline()' and 'gzreadline()' compare - both in
terms of what they actually do, and in terms of how fast they do it.

Yes, well... that's half the question.

-ofer
 
O

odigity

I'm afraid the benchmark is useless. Benchmark doesn't count the
CPU time spent in children, so you're not catching the interesting
part in ext_gzip.

You're probably right. I need to redo this and just use straight clock time.

-ofer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,905
Latest member
Kristy_Poole

Latest Threads

Top