Speed gap between zcat and zlib's GzipReader

Discussion in 'Ruby' started by David G. Andersen, Oct 19, 2004.

  1. I'm still in 1.8.1-land, so this may be old news, but
    GzipReader is (painfully) slow compared to using zcat
    to accomplish the same thing:

    The code:

    #!/scratch/ruby/bin/ruby

    require 'zlib'

    f = ARGV[0]

    s = Time.new
    infile = Zlib::GzipReader.new(File.new(f, "r"))
    #infile = IO.popen("zcat #{f}", "r")
    linecount = 0
    infile.each_line { |l|
    linecount += 1
    }
    e = Time.new
    print "Read #{linecount} lines in #{e - s} seconds\n"

    ------------------------------

    Tested on:
    FreeBSD port-installed ruby 1.8.1
    Freshly compiled 1.8.1
    Freshly compiled 1.8.1 with CFLAGS=-O2
    CVS version, CFLAGS=-O2

    FBSD 1.8.1 1.8.1, O0 1.8.1 -O2 CVS, -O2
    popen zcat: 2.3 2.3 2.3 2.3
    GzipReader: 5.8 9.2 5.8 5.9

    Yowza. Before I poke more, is this expected, or a known
    slowness issue?

    -Dave

    --
    work: me:
    MIT Laboratory for Computer Science http://www.angio.net/
     
    David G. Andersen, Oct 19, 2004
    #1
    1. Advertising

  2. On Fri, Oct 22, 2004 at 11:30:34AM +0900, Clifford Heath scribed:
    > David G. Andersen wrote:
    >
    > > popen("zcat foo.gz", "r") faster than GzipReader.each_line

    >
    > I had a similar problem which was discussed here at length a year or
    > so ago. If you avoid the block setup and use a fixed-length read, it's
    > quite a bit quicker. Still nowhere near as fast as Perl though :-(.


    Ahh, thanks. So the problem is really in GzipReader's each_line
    handling. It's actually pretty close to as fast as it could go
    when doing a fixed-length read. Byte-counting only, fixed-length
    read; popen and gzipreader both take 1.4 seconds on my test file.
    A zcat to /dev/null takes 1.18 seconds. Piping to 'wc' takes 1.83
    seconds. No complaints.

    gzfile_read is fast.
    gzfile_read_more is fast (used by gzfile_read).
    But gzreader_gets... is a dog. It does a memcmp()
    on each byte of the input string to test it against
    the delimiter - yow! So, it looks like zlib's "gets"
    needs the equivalent of rb_io_getline_fast. Would
    be nice if that were easily re-used, but the FILE *
    access is buried pretty deep inside of it.

    Guess I'll have to dig up some spare time next week. :)

    -Dave

    --
    work: me:
    MIT Laboratory for Computer Science http://www.angio.net/
     
    David G. Andersen, Oct 26, 2004
    #2
    1. Advertising

  3. On Tue, Oct 26, 2004 at 10:06:55AM +0900, David G. Andersen scribed:
    >
    > Ahh, thanks. So the problem is really in GzipReader's each_line
    > handling.
    > [...]


    > But gzreader_gets... is a dog. It does a memcmp()
    > on each byte of the input string to test it against


    I've attached a patch that reduces some of the overhead
    for files with longer lines (but doesn't fix all of the
    slowdowns). Some benchmarks, w/1.8.1 on FreeBSD,
    grabbing data out of the gzipped file with file.gets():

    "tarfile" - compressed JDK. Line length is long (random data...)
    "words" - /usr/share/dict/words gzipped. Lines are very short.
    "logfile" - logfile from one of my experiments. Lines are
    between 15 and 120 bytes long.

    popen GzReader-orig GzReader-patched
    ----- ------------- ----------------
    tarfile 2.06 5.65 2.95
    words 0.914 2.4 2.22
    logfile 1.18 3.65 2.27

    The patch is tiny and non-intrusive, which is a bonus, though its
    performance improvement is not spectacular for short lines. Helps
    with gzipped logfiles, at least, but someone with more {time,
    knowledge of ruby's internals} might want to go in and overhaul
    things for real.

    -Dave


    --- orig-zlib.c Mon Oct 25 22:01:18 2004
    +++ zlib.c Mon Oct 25 22:33:26 2004
    @@ -2470,7 +2470,7 @@
    {
    struct gzfile *gz = get_gzfile(obj);
    VALUE rs, dst;
    - char *rsptr, *p;
    + char *rsptr, *p, *res;
    long rslen, n;
    int rspara;

    @@ -2520,8 +2520,15 @@
    gzfile_read_more(gz);
    p = RSTRING(gz->z.buf)->ptr + n - rslen;
    }
    - if (memcmp(p, rsptr, rslen) == 0) break;
    - p++, n++;
    + res = memchr(p, rsptr[0], (gz->z.buf_filled - n + 1));
    + if (!res) {
    + n = gz->z.buf_filled + 1;
    + } else {
    + n += (long)(res - p);
    + p = res;
    + if (rslen == 1 || memcmp(p, rsptr, rslen) == 0) break;
    + p++, n++;
    + }
    }

    gz->lineno++;
     
    David G. Andersen, Oct 26, 2004
    #3
  4. Hi,

    In message "Re: Speed gap between zcat and zlib's GzipReader"
    on Tue, 26 Oct 2004 11:37:50 +0900, "David G. Andersen" <> writes:

    |I've attached a patch that reduces some of the overhead
    |for files with longer lines (but doesn't fix all of the
    |slowdowns). Some benchmarks, w/1.8.1 on FreeBSD,
    |grabbing data out of the gzipped file with file.gets():

    I'm impressed. I will merge your patch.

    matz.
     
    Yukihiro Matsumoto, Oct 26, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. chlori

    Gap in IE, no gap in FF

    chlori, Jan 19, 2006, in forum: HTML
    Replies:
    1
    Views:
    482
    kchayka
    Jan 19, 2006
  2. Replies:
    2
    Views:
    532
  3. J-H Johansen

    Info regarding Zlib::GzipReader

    J-H Johansen, Jun 15, 2007, in forum: Ruby
    Replies:
    0
    Views:
    142
    J-H Johansen
    Jun 15, 2007
  4. Jos Backus
    Replies:
    10
    Views:
    524
    Jeremy Bopp
    Feb 4, 2011
  5. Thomas Wolf
    Replies:
    5
    Views:
    1,250
    Simon Krahnke
    Apr 26, 2012
Loading...

Share This Page