Zlib::GzipReader and multiple compressed blobs in a single stream

Discussion in 'Ruby' started by Jos Backus, Jan 28, 2011.

  1. Jos Backus

    Jos Backus Guest

    [Note: parts of this message were removed to make it a legal post.]

    Hi,

    I'm trying to inflate a set of concatenated gzipped blobs stored in a single
    file. As it stands, Zlib::GzipReader only inflates the first blob. It
    appears that the unused instance method would return the remaining data,
    ready to be passed into Zlib::GzipReader, but it yields an error:

    method `method_missing' called on hidden T_STRING object

    What could be going on here?

    On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
    output stream (zstream.total_out) whereas I am looking for the position in
    the input stream. I tried making zstream.total_in available but the value
    appears to be 18 bytes short in my test file, that is, the next header is
    found 18 bytes beyond what zstream.total_in reports.

    Does anybody know how to make the library return the correct offset into the
    input stream so multiple compressed blobs can be handled?

    Thanks,
    Jos

    --
    Peace cannot be achieved through violence, it can only be attained through
    understanding.
     
    Jos Backus, Jan 28, 2011
    #1
    1. Advertising

  2. Jos Backus

    Jeremy Bopp Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On 01/28/2011 05:09 PM, Jos Backus wrote:
    > Hi,
    >
    > I'm trying to inflate a set of concatenated gzipped blobs stored in a single
    > file. As it stands, Zlib::GzipReader only inflates the first blob. It
    > appears that the unused instance method would return the remaining data,
    > ready to be passed into Zlib::GzipReader, but it yields an error:
    >
    > method `method_missing' called on hidden T_STRING object
    >
    > What could be going on here?


    I'm not sure what's going on, but I was hoping you could solve your
    problem by running something like this:

    File.open('gzipped.blobs') do |f|
    begin
    loop do
    Zlib::GzipReader.open(f) do |gz|
    puts gz.read
    end
    end
    rescue Zlib::GzipFile::Error
    # End of file reached.
    end
    end

    Unfortunately, Ruby 1.8 doesn't appear to support passing anything other
    than a file name to Zlib::GzipReader.open, and Ruby 1.9 seems to always
    reset the file position to the beginning of the file prior to starting
    extraction when you really need it to just start working from the
    current position. So it doesn't appear that you can do this with the
    standard library.

    As part of a ZIP library I wrote, there is a more general implementation
    of a Zlib stream filter. Install the archive-zip gem and then try the
    following:

    gem 'archive-zip'
    require 'archive/support/zlib'

    File.open('gzipped.blobs') do |f|
    until f.eof? do
    Zlib::ZReader.open(f, 15 + 16) do |gz|
    gz.delegate_read_size = 1
    puts gz.read
    end
    end
    end


    This isn't super efficient because we have to hack the
    delegate_read_size to be 1 byte in order to ensure that the trailing
    gzip data isn't sucked into the read buffer of the current ZReader
    instance and hence lost between iterations. It shouldn't be too bad
    though since the File object should be handling its own buffering.

    BTW, I wrote some pretty detailed documentation for Zlib::ZReader. It
    should explain what the 15 + 16 is all about in the open method in case
    you need to tweak things for your own streams.

    > On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
    > output stream (zstream.total_out) whereas I am looking for the position in
    > the input stream. I tried making zstream.total_in available but the value
    > appears to be 18 bytes short in my test file, that is, the next header is
    > found 18 bytes beyond what zstream.total_in reports.


    I think total_in is counting only the compressed data; however,
    following the compressed data is a trailer as required for gzip blobs.
    You could probably always add 18 to whatever you get, but as I noted
    earlier, the implementation of GzipReader seems to always reset any file
    object back to the beginning of the stream rather than start processing
    it from an existing position. I can't find any documentation listing a
    way to force GzipReader to jump to any other file position after
    initialization either.

    > Does anybody know how to make the library return the correct offset into the
    > input stream so multiple compressed blobs can be handled?


    Hopefully, my solution will work for you because I don't think the
    current implementation in the standard library will do what you need.

    -Jeremy
     
    Jeremy Bopp, Jan 30, 2011
    #2
    1. Advertising

  3. Jos Backus

    Jos Backus Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    Hi Jeremy,

    Thanks for your reply.

    On Mon, Jan 31, 2011 at 02:28:30AM +0900, Jeremy Bopp wrote:
    > On 01/28/2011 05:09 PM, Jos Backus wrote:

    [snip]
    > > Hi,
    > >
    > > I'm trying to inflate a set of concatenated gzipped blobs stored in a single
    > > file. As it stands, Zlib::GzipReader only inflates the first blob. It
    > > appears that the unused instance method would return the remaining data,
    > > ready to be passed into Zlib::GzipReader, but it yields an error:
    > >
    > > method `method_missing' called on hidden T_STRING object
    > >
    > > What could be going on here?

    >
    > I'm not sure what's going on, but I was hoping you could solve your
    > problem by running something like this:
    >
    > File.open('gzipped.blobs') do |f|
    > begin
    > loop do
    > Zlib::GzipReader.open(f) do |gz|
    > puts gz.read
    > end
    > end
    > rescue Zlib::GzipFile::Error
    > # End of file reached.
    > end
    > end


    I tried something like this but as you point out, it doesn't work.

    > Unfortunately, Ruby 1.8 doesn't appear to support passing anything other
    > than a file name to Zlib::GzipReader.open, and Ruby 1.9 seems to always
    > reset the file position to the beginning of the file prior to starting
    > extraction when you really need it to just start working from the
    > current position. So it doesn't appear that you can do this with the
    > standard library.


    That's what it looks like, yes. Bummer.

    > As part of a ZIP library I wrote, there is a more general implementation
    > of a Zlib stream filter. Install the archive-zip gem and then try the
    > following:
    >
    > gem 'archive-zip'
    > require 'archive/support/zlib'
    >
    > File.open('gzipped.blobs') do |f|
    > until f.eof? do
    > Zlib::ZReader.open(f, 15 + 16) do |gz|
    > gz.delegate_read_size = 1
    > puts gz.read
    > end
    > end
    > end
    >
    >
    > This isn't super efficient because we have to hack the
    > delegate_read_size to be 1 byte in order to ensure that the trailing
    > gzip data isn't sucked into the read buffer of the current ZReader
    > instance and hence lost between iterations. It shouldn't be too bad
    > though since the File object should be handling its own buffering.


    This works, but sadly it is very slow. Whereas zcat takes under a second on my
    test file, this code takes about 17 seconds.

    > BTW, I wrote some pretty detailed documentation for Zlib::ZReader. It
    > should explain what the 15 + 16 is all about in the open method in case
    > you need to tweak things for your own streams.


    Great. But I didn't have to tweak anything, it just worked :)

    > > On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
    > > output stream (zstream.total_out) whereas I am looking for the position in
    > > the input stream. I tried making zstream.total_in available but the value
    > > appears to be 18 bytes short in my test file, that is, the next header is
    > > found 18 bytes beyond what zstream.total_in reports.

    >
    > I think total_in is counting only the compressed data; however,
    > following the compressed data is a trailer as required for gzip blobs.
    > You could probably always add 18 to whatever you get, but as I noted
    > earlier, the implementation of GzipReader seems to always reset any file
    > object back to the beginning of the stream rather than start processing
    > it from an existing position. I can't find any documentation listing a
    > way to force GzipReader to jump to any other file position after
    > initialization either.


    Yeah, you'd have to feed GZipReader the right part of the input stream
    yourself and figure out how much it processed. Something tells me it's not
    always 18 but depends on internal buffering, which would invalidate the
    assumption of a fixed offset.

    > > Does anybody know how to make the library return the correct offset into the
    > > input stream so multiple compressed blobs can be handled?

    >
    > Hopefully, my solution will work for you because I don't think the
    > current implementation in the standard library will do what you need.


    It does, but it's very slow. Sigh.

    Thanks again, Jeremy.

    Cheers,
    Jos
    --
    Jos Backus
    jos at catnook.com
     
    Jos Backus, Feb 2, 2011
    #3
  4. Jos Backus

    Jeremy Bopp Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On 2/2/2011 1:37 PM, Jos Backus wrote:
    > It does, but it's very slow. Sigh.


    While I don't think you'll be able to make it as fast as zcat, given
    that zcat is 100% native code, you might be able to take the
    implementation of Zlib::ZReader and tweak it to avoid the need to read
    only 1 byte at a time from the delegate stream. Doing so should speed
    things up quite a bit. The existing code really isn't very involved.
    Most of the logic you would need to tweak is in the
    Zlib::ZReader#unbuffered_read method, which is actually fairly short.

    When @inflater reports that it has finished, it looks like you should be
    able to get whatever is left in its input buffer using
    @inflater.flush_next_in (from Zlib::ZStream). Then you can initialize a
    new Zlib::Inflater instance and pass that remaining data as the first
    input buffer to process. You would repeat this process every time the
    inflater reports it has finished until the end of the delegate is
    reached and there is no further data returned by flush_next_in.

    If I get some time this evening, I'll look into creating a sample
    implementation. No promises though. :)

    -Jeremy
     
    Jeremy Bopp, Feb 2, 2011
    #4
  5. Jos Backus

    Eric Hodel Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On Jan 28, 2011, at 15:09, Jos Backus wrote:
    > I'm trying to inflate a set of concatenated gzipped blobs stored in a =

    single
    > file. As it stands, Zlib::GzipReader only inflates the first blob. It
    > appears that the unused instance method would return the remaining =

    data,
    > ready to be passed into Zlib::GzipReader, but it yields an error:
    >=20
    > method `method_missing' called on hidden T_STRING object
    >=20
    > What could be going on here?


    It's a bug, the internal buffer that libz uses is dup'd, but this is not =
    enough to make it safe for use by ruby. I have filed a ticket and =
    attached a stupid patch:

    http://redmine.ruby-lang.org/issues/show/4360=
     
    Eric Hodel, Feb 3, 2011
    #5
  6. Jos Backus

    Jeremy Bopp Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On 02/02/2011 07:33 PM, Eric Hodel wrote:
    > On Jan 28, 2011, at 15:09, Jos Backus wrote:
    >> I'm trying to inflate a set of concatenated gzipped blobs stored in a single
    >> file. As it stands, Zlib::GzipReader only inflates the first blob. It
    >> appears that the unused instance method would return the remaining data,
    >> ready to be passed into Zlib::GzipReader, but it yields an error:
    >>
    >> method `method_missing' called on hidden T_STRING object
    >>
    >> What could be going on here?

    >
    > It's a bug, the internal buffer that libz uses is dup'd, but this is not enough to make it safe for use by ruby. I have filed a ticket and attached a stupid patch:
    >
    > http://redmine.ruby-lang.org/issues/show/4360


    Once your fix is in place and GZipReader#unused works correctly, is
    there any convenient way to take the returned string and continue
    processing it along with the remaining file contents with an instance of
    GzipReader?

    From my testing, it appears that GzipReader.open in Ruby 1.9 always
    rewinds any IO object you give it before inflating any data, so you
    can't use that method to create your instance if you need to start
    reading from anywhere other than the beginning of the stream.
    GzipReader.new doesn't have that problem, but there isn't any easy way
    to make use of that unused data from the earlier processing along with
    the remaining file contents. According to the documentation, you could
    create an IO-like wrapper that will first feed in that unused data
    followed by the real file data, and GzipReader.new should be able to use
    that, but that's a bit of a mess.

    If all that really is a design limitation of GzipReader, having the
    unused data isn't very useful when attempting to inflate concatenated
    gzip blobs as zcat does. You may be able to make it work with a little
    judicious hacking, but it's certainly more effort than it should be.
    Maybe a ZcatReader is needed to plaster over things?

    BTW, why do GzipReader.open and GzipReader.new behave so differently
    with regard to the IO object you pass into them? They're a little
    closer in operation under Ruby 1.9 than they were under Ruby 1.8, but
    the difference is still surprising given the idiom followed by File.open
    and File.new where File.open is really just a simple wrapper around
    File.new that can help ensure that File#close is called at the end of
    your block.

    -Jeremy
     
    Jeremy Bopp, Feb 3, 2011
    #6
  7. Jos Backus

    Jos Backus Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On Thu, Feb 03, 2011 at 10:33:59AM +0900, Eric Hodel wrote:
    > It's a bug, the internal buffer that libz uses is dup'd, but this is not
    > enough to make it safe for use by ruby. I have filed a ticket and attached
    > a stupid patch:
    >
    > http://redmine.ruby-lang.org/issues/show/4360


    Thanks, Eric!

    --
    Jos Backus
    jos at catnook.com
     
    Jos Backus, Feb 3, 2011
    #7
  8. Jos Backus

    Jos Backus Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On Thu, Feb 03, 2011 at 02:03:49PM +0900, Jeremy Bopp wrote:
    > Once your fix is in place and GZipReader#unused works correctly, is
    > there any convenient way to take the returned string and continue
    > processing it along with the remaining file contents with an instance of
    > GzipReader?


    Fwiw, with the changes just committed to trunk the following code works for me
    on a file with multiple gzipped blobs:

    require 'stringio'
    require 'zlib'

    def inflate(filename)
    File.open(filename) do |file|
    zio = StringIO.new(file.read)
    loop do
    io = Zlib::GzipReader.new zio
    puts io.read
    unused = io.unused
    io.finish
    break if unused.nil?
    zio.pos -= unused.length
    end
    end
    end

    inflate "gz"

    Thanks,
    Jos

    --
    Jos Backus
    jos at catnook.com
     
    Jos Backus, Feb 3, 2011
    #8
  9. Jos Backus

    Jeremy Bopp Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On 2/3/2011 3:57 PM, Jos Backus wrote:
    > On Thu, Feb 03, 2011 at 02:03:49PM +0900, Jeremy Bopp wrote:
    >> Once your fix is in place and GZipReader#unused works correctly, is
    >> there any convenient way to take the returned string and continue
    >> processing it along with the remaining file contents with an instance of
    >> GzipReader?

    >
    > Fwiw, with the changes just committed to trunk the following code works for me
    > on a file with multiple gzipped blobs:
    >
    > require 'stringio'
    > require 'zlib'
    >
    > def inflate(filename)
    > File.open(filename) do |file|
    > zio = StringIO.new(file.read)
    > loop do
    > io = Zlib::GzipReader.new zio
    > puts io.read
    > unused = io.unused
    > io.finish
    > break if unused.nil?
    > zio.pos -= unused.length
    > end
    > end
    > end
    >
    > inflate "gz"


    That's great! How does the performance compare to zcat with your data?

    BTW, this implementation does require that you have enough memory to
    hold all of the gzipped file data at once. That will be a problem with
    sufficiently large files or constrained resources.

    -Jeremy
     
    Jeremy Bopp, Feb 3, 2011
    #9
  10. Jos Backus

    Jos Backus Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On Fri, Feb 04, 2011 at 07:38:04AM +0900, Jeremy Bopp wrote:
    > That's great! How does the performance compare to zcat with your data?


    Comparable:

    % time zcat gz > /dev/null
    zcat gz > /dev/null 0.29s user 0.00s system 99% cpu 0.296 total
    % time ./gzr > /dev/null
    /gzr > /dev/null 0.31s user 0.07s system 99% cpu 0.383 total
    %

    > BTW, this implementation does require that you have enough memory to
    > hold all of the gzipped file data at once. That will be a problem with
    > sufficiently large files or constrained resources.


    Using the file directly should avoid that. Since we have a File, we don't need
    the StringIO object:

    require 'stringio'
    require 'zlib'

    def inflate(filename)
    File.open(filename) do |file|
    zio = file
    loop do
    io = Zlib::GzipReader.new zio
    puts io.read
    unused = io.unused
    io.finish
    break if unused.nil?
    zio.pos -= unused.length
    end
    end
    end

    inflate "gz"

    Cheers,
    Jos

    --
    Jos Backus
    jos at catnook.com
     
    Jos Backus, Feb 4, 2011
    #10
  11. Jos Backus

    Jeremy Bopp Guest

    Re: Zlib::GzipReader and multiple compressed blobs in a singlestream

    On 02/03/2011 06:12 PM, Jos Backus wrote:
    > On Fri, Feb 04, 2011 at 07:38:04AM +0900, Jeremy Bopp wrote:
    >> That's great! How does the performance compare to zcat with your data?

    >
    > Comparable:
    >
    > % time zcat gz > /dev/null
    > zcat gz > /dev/null 0.29s user 0.00s system 99% cpu 0.296 total
    > % time ./gzr > /dev/null
    > ./gzr > /dev/null 0.31s user 0.07s system 99% cpu 0.383 total
    > %


    Excellent.

    >> BTW, this implementation does require that you have enough memory to
    >> hold all of the gzipped file data at once. That will be a problem with
    >> sufficiently large files or constrained resources.

    >
    > Using the file directly should avoid that. Since we have a File, we don't need
    > the StringIO object:
    >
    > require 'stringio'
    > require 'zlib'
    >
    > def inflate(filename)
    > File.open(filename) do |file|
    > zio = file
    > loop do
    > io = Zlib::GzipReader.new zio
    > puts io.read
    > unused = io.unused
    > io.finish
    > break if unused.nil?
    > zio.pos -= unused.length
    > end
    > end
    > end
    >
    > inflate "gz"


    The only case where I could see this failing now is if you were given a
    non-seekable IO such as a socket or a pipe from which to read. Of
    course, I apparently haven't been thinking of solutions to these
    problems myself very well, but you'll probably figure out something
    pretty quick. ;-)

    -Jeremy
     
    Jeremy Bopp, Feb 4, 2011
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David G. Andersen

    Speed gap between zcat and zlib's GzipReader

    David G. Andersen, Oct 19, 2004, in forum: Ruby
    Replies:
    3
    Views:
    517
    Yukihiro Matsumoto
    Oct 26, 2004
  2. Replies:
    2
    Views:
    531
  3. J-H Johansen

    Info regarding Zlib::GzipReader

    J-H Johansen, Jun 15, 2007, in forum: Ruby
    Replies:
    0
    Views:
    142
    J-H Johansen
    Jun 15, 2007
  4. Thomas Wolf
    Replies:
    5
    Views:
    1,249
    Simon Krahnke
    Apr 26, 2012
  5. Replies:
    0
    Views:
    989
Loading...

Share This Page