Add a file to a compressed tarfile

Dennis Hotson · Nov 5, 2004

Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support
compressed tarfiles...

Any thoughts on what I can do to get around this?

Cheers!

Martin Franklin · Nov 5, 2004

Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support
compressed tarfiles...

Any thoughts on what I can do to get around this?

Cheers!

From the tarfile docs in python 2.3:-

New in version 2.3.

The tarfile module makes it possible to read and create tar archives. Some
facts and figures:

reads and writes gzip and bzip2 compressed archives.
creates POSIX 1003.1-1990 compliant or GNU tar compatible archives.
reads GNU tar extensions longname, longlink and sparse.
stores pathnames of unlimited length using GNU tar extensions.
handles directories, regular files, hardlinks, symbolic links, fifos,
character devices and block devices and is able to acquire and restore
file information like timestamp, access permissions and owner.
can handle tape devices.

open( [name[, mode [, fileobj[, bufsize]]]])
Return a TarFile object for the pathname name. For detailed information on
TarFile objects, see TarFile Objects (section 7.19.1).

mode has to be a string of the form 'filemode[:compression]', it defaults
to 'r'. Here is a full list of mode combinations:

mode action
'r' Open for reading with transparent compression (recommended).
'r:' Open for reading exclusively without compression.
'r:gz' Open for reading with gzip compression.
'r:bz2' Open for reading with bzip2 compression.
'a' or 'a:' Open for appending with no compression.
'w' or 'w:' Open for uncompressed writing.
'w:gz' Open for gzip compressed writing.
'w:bz2' Open for bzip2 compressed writing.

Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to
open a certain (compressed) file for reading, ReadError is raised. Use
mode 'r' to avoid this. If a compression method is not supported,
CompressionError is raised.

If fileobj is specified, it is used as an alternative to a file object
opened for name.

HTH,
Martin.

Martin Franklin · Nov 5, 2004

<snip - useless info from myself>

Sorry I just re-read your message after sending my reply....

Dennis Hotson · Nov 5, 2004

<snip - useless info from myself>

Sorry I just re-read your message after sending my reply....

Ahh ok... Yeah, I've already seen the docs... thanks anyway!

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

Cheers!

Dennis

Eddie Corns · Nov 5, 2004

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

There isn't really any other way. A tar file is terminated by two empty
blocks. In order to append to a tar file you simply append a new tar file two
blocks from the end of the original. If it was uncompressed you just seek
back from the end and write but if it's compressed you can't find that point
without decompressing[1]. In some cases a more time efficient but less space
efficient method would be to just compress individual files in a directory and
then tar them up before the final distribution (or whatever you do with your
tar file)

Eddie

[1] I think, unless there's a clever way of just decompressing the last few
blocks.

Josiah Carlson · Nov 5, 2004

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

Click to expand...

There isn't really any other way. A tar file is terminated by two empty
blocks. In order to append to a tar file you simply append a new tar file two
blocks from the end of the original. If it was uncompressed you just seek
back from the end and write but if it's compressed you can't find that point
without decompressing[1]. In some cases a more time efficient but less space
efficient method would be to just compress individual files in a directory and
then tar them up before the final distribution (or whatever you do with your
tar file)

Eddie

[1] I think, unless there's a clever way of just decompressing the last few
blocks.

I am not aware of any such method. I am fairly certain gzip (and the
associated zlib) does the following:

while bytes remaining:
reset/initialize state
while state is not crappy and bytes remaining:
compress portion of remaining bytes
update state

Even if one could discover the last reset/initialization of state, one
would still need to decompress the data from then on in order to
discover the two empty blocks.

A 'resume compression friendly' algorithm would necessarily need to
describe its internal state at the end of the byte stream. In the case
of gzip (or other similar compression algorithms), really the only way
this is reasonable is to just give an offset in the file to the last
reset/initialization. Of course the internal state must still be
regenerated from the remaining portion of the file (which may be the
entire file), so isn't really a win over just processing the entire file
again with an algorithm that discovers when/where to pick up where it
left off before.

- Josiah

Heiko Wundram · Nov 6, 2004

Am Freitag, 5. November 2004 19:19 schrieb Josiah Carlson:

I am not aware of any such method. I am fairly certain gzip (and the
associated zlib) does the following:

while bytes remaining:
reset/initialize state
while state is not crappy and bytes remaining:
compress portion of remaining bytes
update state

Even if one could discover the last reset/initialization of state, one
would still need to decompress the data from then on in order to
discover the two empty blocks.

This is not entirely true... There is a full flush which is done every n bytes
(n > 100000 bytes, IIRC), and can also be forced by the programmer. In case
you do a full flush, the block which you read is complete as is up till the
point you did the flush.

From the documentation:

"""flush([mode])

All pending input is processed, and a string containing the remaining
compressed output is returned. mode can be selected from the constants
Z_SYNC_FLUSH, Z_FULL_FLUSH, or Z_FINISH, defaulting to Z_FINISH. Z_SYNC_FLUSH
and Z_FULL_FLUSH allow compressing further strings of data and are used to
allow partial error recovery on decompression, while Z_FINISH finishes the
compressed stream and prevents compressing any more data. After calling
flush() with mode set to Z_FINISH, the compress() method cannot be called
again; the only realistic action is to delete the object."""

Anyway, the state is reset to the initial state after the full flush, so that
the next block of data is independent from the block that was flushed. So,
you might start writing after the full flush, but you'd have to make sure
that the compressed stream was of the same format specification as the one
previously written (see the compression level parameter of
compress/decompress), and you'd also have to make sure that the gzip header
is supressed, and that the FINISH compression block correctly reflects the
data that was appended (because you basically overwrite the finish block of
the first compress).

Little example:

import zlib
x = zlib.compressobj(6)
x

Click to expand...

a = x.compress("hahahahahaha"*20)
a += x.flush(zlib.Z_FULL_FLUSH)
a 'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff\xff'
b = x.flush(zlib.Z_FINISH)
b '\x03\x00^\x84^9'
x = zlib.compressobj(6) # New compression object with same compression.
c = x.compress("hahahahahaha"*20)
c += x.flush(zlib.Z_FULL_FLUSH)
c 'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff\xff'
d = x.flush(zlib.Z_FINISH)
d '\x03\x00^\x84^9'
e = a+c[2:] # Strip header of second block.
x = zlib.decompressobj()
f = x.decompress(e)
len(f) 480 # Two times 240 = 480.
f

Click to expand...

Click to expand...

'haha...' # Rest stripped for clarity.

So, as far as this goes, it works. But:

x = zlib.decompressobj()
e = a+c[2:]+d
f = x.decompress(e)

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in ?
zlib.error: Error -3 while decompressing: incorrect data check

You see here that if you append the new end of stream marker of the second
block (which is written by x.flush(zlib.Z_FINISH)), the data checksum is
broken, as the data checksum is always written for the entire data, but
leaving out the end of stream marker doesn't cause data-decompression to
fail.

I know too little about the internal format of a gzip file (which appends more
header data, but otherwise is just a zlib compressed stream) to tell whether
an approach such as this one would also work on gzip-files, but I presume it
should.

Hope this little explanation helps!

Heiko.

Dennis Hotson · Nov 6, 2004

Thanks Heiko, Thats really interesting..

To tell you the truth though, I'm not that familiar with the structure of
tar or gzip files. I've got a much better idea of how it works now though.

I managed to get my function working... although it decompresses
everything and then compresses it back... Not the best, but good enough I
think.

Speed isn't a huge issue in my case anyway because this is for a web app
I'm writing... It's a directory tree which allows people to download and
upload files into/from directories as well as compressed archives.

Anyway.. thanks a lot for your help. I really appreciate it. Cheers mate!

Francesc Alted · Nov 9, 2004

Dennis said:
I managed to get my function working... although it decompresses
everything and then compresses it back... Not the best, but good enough I
think.

If you want a solution that allows to append files to an archive, while
allowing compression, take a look at FileNode, a module that has been added
to the latest PyTables package (www.pytables.org). You can see the
documentation (and tutorials) for the module here:

http://pytables.sourceforge.net/html-doc/c3616.html

It supports the zlib, ucl and lzo compressors, as well as the shuffle
compression pre-conditioner.

HTH,

Francesc Altet

How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	1	Mar 22, 2024
3.2 can't extract tarfile produced by 2.7	0	Dec 26, 2012
question on using tarfile to read a *.tar.gzip file	2	Feb 7, 2010
[PyWart 1001] Inconsistencies between zipfile and tarfile APIs	20	Jul 22, 2011
compression level with tarfile (w:gz) ?	0	Aug 10, 2009
Add a text file that a user specified the name of in a program to a directory	0	Apr 28, 2022
Create TarFile using string buffers	7	Mar 19, 2007
How to add dropdown selected data to table using jquery	2	Jul 2, 2022

Add a file to a compressed tarfile

Dennis Hotson

Martin Franklin

Martin Franklin

Dennis Hotson

Eddie Corns

Josiah Carlson

Heiko Wundram

Dennis Hotson

Francesc Alted

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads