Add a file to a compressed tarfile

D

Dennis Hotson

Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!
 
M

Martin Franklin

Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!


From the tarfile docs in python 2.3:-

New in version 2.3.

The tarfile module makes it possible to read and create tar archives. Some
facts and figures:

reads and writes gzip and bzip2 compressed archives.
creates POSIX 1003.1-1990 compliant or GNU tar compatible archives.
reads GNU tar extensions longname, longlink and sparse.
stores pathnames of unlimited length using GNU tar extensions.
handles directories, regular files, hardlinks, symbolic links, fifos,
character devices and block devices and is able to acquire and restore
file information like timestamp, access permissions and owner.
can handle tape devices.

open( [name[, mode [, fileobj[, bufsize]]]])
Return a TarFile object for the pathname name. For detailed information on
TarFile objects, see TarFile Objects (section 7.19.1).

mode has to be a string of the form 'filemode[:compression]', it defaults
to 'r'. Here is a full list of mode combinations:

mode action
'r' Open for reading with transparent compression (recommended).
'r:' Open for reading exclusively without compression.
'r:gz' Open for reading with gzip compression.
'r:bz2' Open for reading with bzip2 compression.
'a' or 'a:' Open for appending with no compression.
'w' or 'w:' Open for uncompressed writing.
'w:gz' Open for gzip compressed writing.
'w:bz2' Open for bzip2 compressed writing.

Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to
open a certain (compressed) file for reading, ReadError is raised. Use
mode 'r' to avoid this. If a compression method is not supported,
CompressionError is raised.

If fileobj is specified, it is used as an alternative to a file object
opened for name.


HTH,
Martin.
 
M

Martin Franklin

<snip - useless info from myself>

Sorry I just re-read your message after sending my reply....
 
D

Dennis Hotson

<snip - useless info from myself>

Sorry I just re-read your message after sending my reply....

Ahh ok... Yeah, I've already seen the docs... thanks anyway! :D

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

Cheers!

Dennis
 
E

Eddie Corns

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

There isn't really any other way. A tar file is terminated by two empty
blocks. In order to append to a tar file you simply append a new tar file two
blocks from the end of the original. If it was uncompressed you just seek
back from the end and write but if it's compressed you can't find that point
without decompressing[1]. In some cases a more time efficient but less space
efficient method would be to just compress individual files in a directory and
then tar them up before the final distribution (or whatever you do with your
tar file)

Eddie

[1] I think, unless there's a clever way of just decompressing the last few
blocks.
 
J

Josiah Carlson

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

There isn't really any other way. A tar file is terminated by two empty
blocks. In order to append to a tar file you simply append a new tar file two
blocks from the end of the original. If it was uncompressed you just seek
back from the end and write but if it's compressed you can't find that point
without decompressing[1]. In some cases a more time efficient but less space
efficient method would be to just compress individual files in a directory and
then tar them up before the final distribution (or whatever you do with your
tar file)

Eddie

[1] I think, unless there's a clever way of just decompressing the last few
blocks.

I am not aware of any such method. I am fairly certain gzip (and the
associated zlib) does the following:

while bytes remaining:
reset/initialize state
while state is not crappy and bytes remaining:
compress portion of remaining bytes
update state

Even if one could discover the last reset/initialization of state, one
would still need to decompress the data from then on in order to
discover the two empty blocks.

A 'resume compression friendly' algorithm would necessarily need to
describe its internal state at the end of the byte stream. In the case
of gzip (or other similar compression algorithms), really the only way
this is reasonable is to just give an offset in the file to the last
reset/initialization. Of course the internal state must still be
regenerated from the remaining portion of the file (which may be the
entire file), so isn't really a win over just processing the entire file
again with an algorithm that discovers when/where to pick up where it
left off before.

- Josiah
 
H

Heiko Wundram

Am Freitag, 5. November 2004 19:19 schrieb Josiah Carlson:
I am not aware of any such method. I am fairly certain gzip (and the
associated zlib) does the following:

while bytes remaining:
reset/initialize state
while state is not crappy and bytes remaining:
compress portion of remaining bytes
update state

Even if one could discover the last reset/initialization of state, one
would still need to decompress the data from then on in order to
discover the two empty blocks.

This is not entirely true... There is a full flush which is done every n bytes
(n > 100000 bytes, IIRC), and can also be forced by the programmer. In case
you do a full flush, the block which you read is complete as is up till the
point you did the flush.

From the documentation:

"""flush([mode])

All pending input is processed, and a string containing the remaining
compressed output is returned. mode can be selected from the constants
Z_SYNC_FLUSH, Z_FULL_FLUSH, or Z_FINISH, defaulting to Z_FINISH. Z_SYNC_FLUSH
and Z_FULL_FLUSH allow compressing further strings of data and are used to
allow partial error recovery on decompression, while Z_FINISH finishes the
compressed stream and prevents compressing any more data. After calling
flush() with mode set to Z_FINISH, the compress() method cannot be called
again; the only realistic action is to delete the object."""

Anyway, the state is reset to the initial state after the full flush, so that
the next block of data is independent from the block that was flushed. So,
you might start writing after the full flush, but you'd have to make sure
that the compressed stream was of the same format specification as the one
previously written (see the compression level parameter of
compress/decompress), and you'd also have to make sure that the gzip header
is supressed, and that the FINISH compression block correctly reflects the
data that was appended (because you basically overwrite the finish block of
the first compress).

Little example:
import zlib
x = zlib.compressobj(6)
x
a = x.compress("hahahahahaha"*20)
a += x.flush(zlib.Z_FULL_FLUSH)
a 'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff\xff'
b = x.flush(zlib.Z_FINISH)
b '\x03\x00^\x84^9'
x = zlib.compressobj(6) # New compression object with same compression.
c = x.compress("hahahahahaha"*20)
c += x.flush(zlib.Z_FULL_FLUSH)
c 'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff\xff'
d = x.flush(zlib.Z_FINISH)
d '\x03\x00^\x84^9'
e = a+c[2:] # Strip header of second block.
x = zlib.decompressobj()
f = x.decompress(e)
len(f) 480 # Two times 240 = 480.
f
'haha...' # Rest stripped for clarity.

So, as far as this goes, it works. But:
x = zlib.decompressobj()
e = a+c[2:]+d
f = x.decompress(e)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
zlib.error: Error -3 while decompressing: incorrect data check

You see here that if you append the new end of stream marker of the second
block (which is written by x.flush(zlib.Z_FINISH)), the data checksum is
broken, as the data checksum is always written for the entire data, but
leaving out the end of stream marker doesn't cause data-decompression to
fail.

I know too little about the internal format of a gzip file (which appends more
header data, but otherwise is just a zlib compressed stream) to tell whether
an approach such as this one would also work on gzip-files, but I presume it
should.

Hope this little explanation helps!

Heiko.
 
D

Dennis Hotson

Thanks Heiko, Thats really interesting..

To tell you the truth though, I'm not that familiar with the structure of
tar or gzip files. I've got a much better idea of how it works now though.
:D

I managed to get my function working... although it decompresses
everything and then compresses it back... Not the best, but good enough I
think.

Speed isn't a huge issue in my case anyway because this is for a web app
I'm writing... It's a directory tree which allows people to download and
upload files into/from directories as well as compressed archives.

Anyway.. thanks a lot for your help. I really appreciate it. Cheers mate!
:)
 
F

Francesc Alted

Dennis said:
I managed to get my function working... although it decompresses
everything and then compresses it back... Not the best, but good enough I
think.

If you want a solution that allows to append files to an archive, while
allowing compression, take a look at FileNode, a module that has been added
to the latest PyTables package (www.pytables.org). You can see the
documentation (and tutorials) for the module here:

http://pytables.sourceforge.net/html-doc/c3616.html

It supports the zlib, ucl and lzo compressors, as well as the shuffle
compression pre-conditioner.

HTH,

Francesc Altet
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top