how to stream or write data into a tar.gz file as if the data werefrom files?

B

bwv549

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive. In this case, it must be
in .tar.gz format and it must unzip into actual files--although I pity
the fellow that actually has to unzip this monstrosity.

Here's the solutions I've come up with so far:

1. Not portable, *extremely* slow:
write out all these "files" into a directory and make a system
call to tar (tar -xzf ...)

2. Portable but still just as slow:
write out all these "files" into a directory and use archive-tar-
minitar to make the archive

3. Not portable, but fast:
stream information into tar/gzip to create the archive (without
ever first writing out files)

I've been looking around on this and the closest I've come is this:
tar cvf - some_directory | gzip - > some_directory.tar.gz

Note that this would still require me to write the files to a
directory (which must be avoided at all costs), but at least the
problem now is how to write data into a tar file. I've been googling
and still haven't turned up anything yet.

4. Hack archive-tar-minitar to enable me to write my data directly
into the format. Looking at the source code, this doesn't seem
terribly hard, but not terribly easy either. Am I missing a method
already written for this kind of thing?

Others?

Right now, anything resembling #3 or #4 would work for me.

My feeling is that it shouldn't be that hard to write data into
a .tar.gz format in either linux or ruby without actually having any
files (i.e., everything in memory or streamed in).

Thanks a lot for any suggestions or ideas!
 
R

Robert Klemme

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive. In this case, it must be
in .tar.gz format and it must unzip into actual files--although I pity
the fellow that actually has to unzip this monstrosity.
3. Not portable, but fast:
stream information into tar/gzip to create the archive (without
ever first writing out files)

I've been looking around on this and the closest I've come is this:
tar cvf - some_directory | gzip - > some_directory.tar.gz

Note that this would still require me to write the files to a
directory (which must be avoided at all costs), but at least the
problem now is how to write data into a tar file. I've been googling
and still haven't turned up anything yet.

So why then do you say "without ever first writing out files"?

I'd say #3 (the original formulation) is the one to go. Googling for
"ruby tar" quickly turned up this:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/32588

And there is zlib which allows to read and write GZip streams. So, if
ruby-tar allows to write into any stream you got your solution.

Kind regards

robert
 
B

Brian Candler

Others?

Although it's not what you're asking for, as you mention "zipping" maybe
you could consider rubyzip:

require 'zip/zipfilesystem'
Zip::ZipFile.open("foo.zip") { |zfs|
zfs.file.open("member.txt") { |f| f << data }
zfs.commit
}

zip is not tar, but it does have a some advantages - in particular the
ability to get random-access to any particular member without having to
read through the whole thing from the start.
My feeling is that it shouldn't be that hard to write data into
a .tar.gz format in either linux or ruby without actually having any
files (i.e., everything in memory or streamed in).

When reading, rubyzip lets you spool directly out of the zip. When
writing, I think that behind the scenes it spools to a tempfile, and
when you commit it then packs this into the archive.
 
B

bwv549

So why then do you say "without ever first writing out files"?

I'm just trying to show that if I can stream out a tar file, then I
can at least pipe it into gzip (on many OS's). So, I'm really stuck
at making a tar file without actually having to write files to disk
first.
And there is zlib which allows to read and write GZip streams.  So, if
ruby-tar allows to write into any stream you got your solution.

I looked at ruby-tar (on your suggestion) but ruby-tar turns out to
not have any write capabilities.

So, I'm still looking deeper into archive-tar-minitar. I also found
'tarruby' (bindings to the C libtar library) in rubyforge but it seems
more difficult to hack into than minitar.

As pointed out, the difficulty here has been narrowed down to writing
tar files without having to write files out to disk first.

Sincere thanks for the suggestions.
 
B

bwv549

you could consider rubyzip:
  require 'zip/zipfilesystem'
  Zip::ZipFile.open("foo.zip") { |zfs|
    zfs.file.open("member.txt") { |f| f << data }
    zfs.commit
  }

This is *exactly* what I need to be able to do, except with .tar.gz
files. I will use this solution for now, even while still searching
for (or maybe writing) the .tar.gz equivalent. Short term, this will
get me by... [even though a .tar.gz equivalent would be really nice].

Thanks!!
 
B

Brian Candler

This is *exactly* what I need to be able to do, except with .tar.gz
files. I will use this solution for now

Do test it though. I tested it streaming large files in (100MB), and
found that it created a tempfile behind the scenes. If it does this for
*all* files, then it may not be any more efficient than using
archive-tar-minitar.

But it does have a simple API, which is essentially the same as File and
Dir. (Although unfortunately you can't use it to open a zipfile which is
within a zipfile :)
 
B

bwv549

Do test it though. I tested it streaming large files in (100MB), and

Yes, upon testing I saw that it was creating a bunch of temp files,
too. It's too bad since the API is so clean! Perhaps it will be
reimplemented someday...

********************************************************************
********************** A solution using Minitar *******************

So, I hacked on archive-tar-minitar for a while and came up with a
solution. Right now I add a class method that fits with the style of
the pack_file method (indeed, pilfers most of its code) and then I can
access it using the slightly lower level interface than 'pack':

require 'archive/tar/minitar'
require 'stringio'

module Archive::Tar::Minitar

# entry may be a string (the name), or it may be a hash specifying
the
# following:
# :name (REQUIRED)
# :mode 33188 (rw-r--r--) for files, 16877 (rwxr-xr-x) for dirs
# (0O100644) (0O40755)
# :uid nil
# :gid nil
# :mtime Time.now
#
# if data == nil, then this is considered a directory!
# (use an empty string for a normal empty file)
# data should be something that can be opened by StringIO
def self.pack_as_file(entry, data, outputter) #:yields action, name,
stats:
outputter = outputter.tar if outputter.kind_of?
(Archive::Tar::Minitar::Output)

stats = {}
stats[:uid] = nil
stats[:gid] = nil
stats[:mtime] = Time.now

if data.nil?
# a directory
stats[:size] = 4096 # is this OK???
stats[:mode] = 16877 # rwxr-xr-x
else
stats[:size] = data.size
stats[:mode] = 33188 # rw-r--r--
end

if entry.kind_of?(Hash)
name = entry[:name]

entry.each { |kk, vv| stats[kk] = vv unless vv.nil? }
else
name = entry
end

if data.nil? # a directory
yield :dir, name, stats if block_given?
outputter.mkdir(name, stats)
else # a file
outputter.add_file_simple(name, stats) do |os|
stats[:current] = 0
yield :file_start, name, stats if block_given?
StringIo_Open(data, "rb") do |ff|
until ff.eof?
stats[:currinc] = os.write(ff.read(4096))
stats[:current] += stats[:currinc]
yield :file_progress, name, stats if block_given?
end
end
yield :file_done, name, stats if block_given?
end
end
end
end

#####################################
# Then to use it to make a .tgz file:
#####################################

require 'zlib'

file_names = ['a_dir/dorky1', 'dorky2', 'an_empty_dir']
file_data_strings = ['my data', 'my data also', nil]

tgz = Zlib::GzipWriter.new(File.open('my_tar.tgz', 'wb'))

Archive::Tar::Minitar::Output.open(tgz) do |outp|
file_names.zip(file_data_strings) do |name, data|
Archive::Tar::Minitar.pack_as_file(name, data, outp)
end
end

***********************************************************

So, not terribly pretty, but not too terrible either.
 
A

ara.t.howard

This is *exactly* what I need to be able to do, except with .tar.gz
files. I will use this solution for now, even while still searching
for (or maybe writing) the .tar.gz equivalent. Short term, this will
get me by... [even though a .tar.gz equivalent would be really nice].

Thanks!!

IO.popen 'tar cfz -', 'w+' do |pipe|

end

and just send files down the pipe

a @ http://codeforpeople.com/
 
B

Brian Candler

Ara said:
IO.popen 'tar cfz -', 'w+' do |pipe|

end

and just send files down the pipe

Uh??

"tar cfz -" creates a tarfile called "z" and tries to pack a file called
"-" in it.

"tar czf - file1 file2 file3" reads the named files from disk and sends
the *output* to stdout.

If you don't specify any files, then nothing is created:

$ tar -czf -
tar: Cowardly refusing to create an empty archive
Try `tar --help' or `tar --usage' for more information.

That's for gnu tar, maybe others work differently. However, as far as I
know, you can't get tar to read the *content* of files on stdin - and
even if you could, how would you format them? That is, how would you
delimit the start and end of each file, and assign a name to each one?
 
A

ara.t.howard

Uh??

"tar cfz -" creates a tarfile called "z" and tries to pack a file
called
"-" in it.

"tar czf - file1 file2 file3" reads the named files from disk and
sends
the *output* to stdout.

If you don't specify any files, then nothing is created:

$ tar -czf -
tar: Cowardly refusing to create an empty archive
Try `tar --help' or `tar --usage' for more information.

That's for gnu tar, maybe others work differently. However, as far
as I
know, you can't get tar to read the *content* of files on stdin - and
even if you could, how would you format them? That is, how would you
delimit the start and end of each file, and assign a name to each one?



sorry. i misread the OPs question. tar can only unpack to stdout,
not create from stdin.

a @ http://codeforpeople.com/
 
B

Brian Candler

So, I hacked on archive-tar-minitar for a while and came up with a
solution.

You got me interested now.

I just installed the archive-tar-minitar gem and it looks pretty easy to
generate a tar file, without any patching of the library:

require 'rubygems'
require 'archive/tar/minitar'

src = {
"foo.txt" => "This is file foo",
"bar.txt" => "This is file bar",
}

File.open("test.tar","w") do |tarfile|
Archive::Tar::Minitar::Writer.open(tarfile) do |tar|
src.each do |name, data|
tar.add_file_simple(name, :size=>data.size, :mode=>0644) { |f|
f.write(data) }
end
end
end

All I did was a quick poke around the API (gem server --daemon; launch
web browser pointing at http://localhost:8808/) and look for something
called "Writer" :)

HTH,

Brian.
 
R

raleighr3

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive.  In this case, it must be
in .tar.gz format and it must unzip into actual files--although I pity
the fellow that actually has to unzip this monstrosity.

This maybe be a little late, but better late than never.
Have you considered using #1 with a tmpfs and memory mapped files?
This isn't exactly portable, but should be pretty fast since as far as
tar is concerned your in-memory files just look like a regular
filesystem thanks to tmpfs.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top