creating size-limited tar files

Discussion in 'Python' started by andrea crotti, Nov 7, 2012.

  1. Simple problem, given a lot of data in many files/directories, I
    should create a tar file splitted in chunks <= a given size.

    The simplest way would be to compress the whole thing and then split.

    At the moment the actual script which I'm replacing is doing a
    "system('split..')", which is not that great, so I would like to do it
    while compressing.

    So I thought about (in pseudocode)


    while remaining_files:
    tar_file.addfile(remaining_files.pop())
    if size(tar_file) >= limit:
    close(tar_file)
    tar_file = new_tar_file()

    which might work maybe, but how do I get the current size? There
    should be tarinfo.size but it doesn't exist on a TarFile opened in
    write mode, so should I do a stat after each flush?

    Any other better ideas otherwise?
    thanks
    andrea crotti, Nov 7, 2012
    #1
    1. Advertising

  2. andrea crotti

    Neil Cerutti Guest

    On 2012-11-07, andrea crotti <> wrote:
    > Simple problem, given a lot of data in many files/directories, I
    > should create a tar file splitted in chunks <= a given size.
    >
    > The simplest way would be to compress the whole thing and then split.
    >
    > At the moment the actual script which I'm replacing is doing a
    > "system('split..')", which is not that great, so I would like to do it
    > while compressing.
    >
    > So I thought about (in pseudocode)
    >
    > while remaining_files:
    > tar_file.addfile(remaining_files.pop())
    > if size(tar_file) >= limit:
    > close(tar_file)
    > tar_file = new_tar_file()
    >


    I have not used this module before, but what you seem to be
    asking about is:

    TarFile.gettarinfo().size

    But your algorithm stops after the file is already too big.

    --
    Neil Cerutti
    Neil Cerutti, Nov 7, 2012
    #2
    1. Advertising

  3. I don't know the best way to find the current size, I only have a
    general remark.
    This solution is not so good if you have to impose a hard limit on the
    resulting file size. You could end up having a tar file of size "limit +
    size of biggest file - 1 + overhead" in the worst case if the tar is at
    limit - 1 and the next file is the biggest file. Of course that may be
    acceptable in many cases or it may be acceptable to do something about
    it by adjusting the limit.

    My Idea:
    Assuming tar_file works on some object with a file-like interface one
    could implement a "transparent splitting file" class which would have to
    use some kind of buffering mechanism. It would represent a virtual big
    file that is stored in many pieces of fixed size (except the last) and
    would allow you to just add all files to one tar_file and have it split
    up transparently by the underlying file-object, something like

    tar_file = TarFile(SplittingFile(names='archiv.tar-%03d', chunksize=
    chunksize, mode='wb'))
    while remaining_files:
    tar_file.addfile(remaining_files.pop())

    and the splitting_file would automatically create chunks with size
    chunksize and filenames archiv.tar-001, archiv.tar-002, ...

    The same class could be used to put it back together, it may even
    implement transparent seeking over a set of pieces of a big file. I
    would like to have such a class around for general usage.

    greetings
    Alexander Blinne, Nov 7, 2012
    #3
  4. andrea crotti

    Roy Smith Guest

    In article <509ab0fa$0$6636$-online.net>,
    Alexander Blinne <> wrote:

    > I don't know the best way to find the current size, I only have a
    > general remark.
    > This solution is not so good if you have to impose a hard limit on the
    > resulting file size. You could end up having a tar file of size "limit +
    > size of biggest file - 1 + overhead" in the worst case if the tar is at
    > limit - 1 and the next file is the biggest file. Of course that may be
    > acceptable in many cases or it may be acceptable to do something about
    > it by adjusting the limit.


    If you truly have a hard limit, one possible solution would be to use
    tell() to checkpoint the growing archive after each addition. If adding
    a new file unexpectedly causes you exceed your hard limit, you can
    seek() back to the previous spot and truncate the file there.

    Whether this is worth the effort is an exercise left for the reader.
    Roy Smith, Nov 7, 2012
    #4
  5. On 11/07/2012 08:32 PM, Roy Smith wrote:
    > In article <509ab0fa$0$6636$-online.net>,
    > Alexander Blinne <> wrote:
    >
    >> I don't know the best way to find the current size, I only have a
    >> general remark.
    >> This solution is not so good if you have to impose a hard limit on the
    >> resulting file size. You could end up having a tar file of size "limit +
    >> size of biggest file - 1 + overhead" in the worst case if the tar is at
    >> limit - 1 and the next file is the biggest file. Of course that may be
    >> acceptable in many cases or it may be acceptable to do something about
    >> it by adjusting the limit.

    > If you truly have a hard limit, one possible solution would be to use
    > tell() to checkpoint the growing archive after each addition. If adding
    > a new file unexpectedly causes you exceed your hard limit, you can
    > seek() back to the previous spot and truncate the file there.
    >
    > Whether this is worth the effort is an exercise left for the reader.


    So I'm not sure if it's an hard limit or not, but I'll check tomorrow.
    But in general for the size I could also take the size of the files and
    simply estimate the size of all of them,
    pushing as many as they should fit in a tarfile.
    With compression I might get a much smaller file maybe, but it would be
    much easier..

    But the other problem is that at the moment the people that get our
    chunks reassemble the file with a simple:

    cat file1.tar.gz file2.tar.gz > file.tar.gz

    which I suppose is not going to work if I create 2 different tar files,
    since it would recreate the header in all of the them, right?
    So or I give also a script to reassemble everything or I have to split
    in a more "brutal" way..

    Maybe after all doing the final split was not too bad, I'll first check
    if it's actually more expensive for the filesystem (which is very very slow)
    or it's not a big deal...
    Andrea Crotti, Nov 7, 2012
    #5
  6. On 7 November 2012 21:52, Andrea Crotti <> wrote:
    > On 11/07/2012 08:32 PM, Roy Smith wrote:
    >>
    >> In article <509ab0fa$0$6636$-online.net>,
    >> Alexander Blinne <> wrote:
    >>
    >>> I don't know the best way to find the current size, I only have a
    >>> general remark.
    >>> This solution is not so good if you have to impose a hard limit on the
    >>> resulting file size. You could end up having a tar file of size "limit +
    >>> size of biggest file - 1 + overhead" in the worst case if the tar is at
    >>> limit - 1 and the next file is the biggest file. Of course that may be
    >>> acceptable in many cases or it may be acceptable to do something about
    >>> it by adjusting the limit.

    >
    > But the other problem is that at the moment the people that get our chunks
    > reassemble the file with a simple:
    >
    > cat file1.tar.gz file2.tar.gz > file.tar.gz
    >
    > which I suppose is not going to work if I create 2 different tar files,
    > since it would recreate the header in all of the them, right?


    Correct. But if you read the rest of Alexander's post you'll find a
    suggestion that would work in this case and that can guarantee to give
    files of the desired size.

    You just need to define your own class that implements a write()
    method and then distributes any data it receives to separate files.
    You can then pass this as the fileobj argument to the tarfile.open
    function:
    http://docs.python.org/2/library/tarfile.html#tarfile.open


    Oscar
    Oscar Benjamin, Nov 7, 2012
    #6
  7. 2012/11/7 Oscar Benjamin <>:
    >
    > Correct. But if you read the rest of Alexander's post you'll find a
    > suggestion that would work in this case and that can guarantee to give
    > files of the desired size.
    >
    > You just need to define your own class that implements a write()
    > method and then distributes any data it receives to separate files.
    > You can then pass this as the fileobj argument to the tarfile.open
    > function:
    > http://docs.python.org/2/library/tarfile.html#tarfile.open
    >
    >
    > Oscar




    Yes yes I saw the answer, but now I was thinking that what I need is
    simply this:
    tar czpvf - /path/to/archive | split -d -b 100M - tardisk

    since it should run only on Linux it's probably way easier, my script
    will then only need to create the list of files to tar..

    The only doubt is if this is more or less reliably then doing it in
    Python, when can this fail with some bad broken pipe?
    (the filesystem is not very good as I said and it's mounted with NFS)
    andrea crotti, Nov 8, 2012
    #7
  8. 2012/11/8 andrea crotti <>:
    >
    >
    >
    > Yes yes I saw the answer, but now I was thinking that what I need is
    > simply this:
    > tar czpvf - /path/to/archive | split -d -b 100M - tardisk
    >
    > since it should run only on Linux it's probably way easier, my script
    > will then only need to create the list of files to tar..
    >
    > The only doubt is if this is more or less reliably then doing it in
    > Python, when can this fail with some bad broken pipe?
    > (the filesystem is not very good as I said and it's mounted with NFS)


    In the meanwhile I tried a couple of things, and using the pipe on
    Linux actually works very nicely, it's even faster than simple tar for
    some reasons..

    [andrea@andreacrotti isos]$ time tar czpvf - file1.avi file2.avi |
    split -d -b 1000M - inchunks
    file1.avi
    file2.avi

    real 1m39.242s
    user 1m14.415s
    sys 0m7.140s

    [andrea@andreacrotti isos]$ time tar czpvf total.tar.gz file1.avi file2.avi
    file1.avi
    file2.avi

    real 1m41.190s
    user 1m13.849s
    sys 0m5.723s

    [andrea@andreacrotti isos]$ time split -d -b 1000M total.tar.gz inchunks

    real 0m55.282s
    user 0m0.020s
    sys 0m3.553s
    andrea crotti, Nov 8, 2012
    #8
  9. Anyway in the meanwhile I implemented this tar and split in this way below.
    It works very well and it's probably much faster, but the downside is that
    I give away control to tar and split..

    def tar_and_split(inputfile, output, bytes_size=None):
    """Take the file containing all the files to compress, the bytes
    desired for the split and the base name of the output file
    """
    # cleanup first
    for fname in glob(output + "*"):
    logger.debug("Removing old file %s" % fname)
    remove(fname)

    out = '-' if bytes_size else (output + '.tar.gz')
    cmd = "tar czpf {} $(cat {})".format(out, inputfile)
    if bytes_size:
    cmd += "| split -b {} -d - {}".format(bytes_size, output)

    logger.info("Running command %s" % cmd)

    proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,
    stderr=subprocess.PIPE)
    out, err = proc.communicate()
    if err:
    logger.error("Got error messages %s" % err)

    logger.info("Output %s" % out)

    if proc.returncode != 0:
    logger.error("Something failed running %s, need to re-run" % cmd)
    return False
    andrea crotti, Nov 9, 2012
    #9
  10. 2012/11/9 andrea crotti <>:
    > Anyway in the meanwhile I implemented this tar and split in this way below.
    > It works very well and it's probably much faster, but the downside is that
    > I give away control to tar and split..
    >
    > def tar_and_split(inputfile, output, bytes_size=None):
    > """Take the file containing all the files to compress, the bytes
    > desired for the split and the base name of the output file
    > """
    > # cleanup first
    > for fname in glob(output + "*"):
    > logger.debug("Removing old file %s" % fname)
    > remove(fname)
    >
    > out = '-' if bytes_size else (output + '.tar.gz')
    > cmd = "tar czpf {} $(cat {})".format(out, inputfile)
    > if bytes_size:
    > cmd += "| split -b {} -d - {}".format(bytes_size, output)
    >
    > logger.info("Running command %s" % cmd)
    >
    > proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,
    > stderr=subprocess.PIPE)
    > out, err = proc.communicate()
    > if err:
    > logger.error("Got error messages %s" % err)
    >
    > logger.info("Output %s" % out)
    >
    > if proc.returncode != 0:
    > logger.error("Something failed running %s, need to re-run" % cmd)
    > return False



    There is another problem with this solution, if I run something like
    this with Popen:
    cmd = "tar {bigc} -czpf - --files-from {inputfile} | split -b
    {bytes_size} -d - {output}"

    proc = subprocess.Popen(to_run, shell=True,
    stdout=subprocess.PIPE, stderr=subprocess.PIPE)

    the proc.returncode will only be the one from "split", so I lose the
    ability to check if tar failed..

    A solution would be something like this:
    { ls -dlkfjdsl; echo $? > tar.status; } | split

    but it's a bit ugly. I wonder if I can use the subprocess PIPEs to do
    the same thing, is it going to be as fast and work in the same way??
    andrea crotti, Nov 13, 2012
    #10
  11. andrea crotti

    Ian Kelly Guest

    On Tue, Nov 13, 2012 at 3:31 AM, andrea crotti
    <> wrote:
    > but it's a bit ugly. I wonder if I can use the subprocess PIPEs to do
    > the same thing, is it going to be as fast and work in the same way??


    It'll look something like this:

    >>> p1 = subprocess.Popen(cmd1, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>> p2 = subprocess.Popen(cmd2, shell=True, stdin=p1.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>> p1.communicate()

    ('', '')
    >>> p2.communicate()

    ('', '')
    >>> p1.wait()

    0
    >>> p2.wait()

    0

    Note that there's a subtle potential for deadlock here. During the
    p1.communicate() call, if the p2 output buffer fills up, then it will
    stop accepting input from p1 until p2.communicate() can be called, and
    then if that buffer also fills up, p1 will hang. Additionally, if p2
    needs to wait on the parent process for some reason, then you end up
    effectively serializing the two processes.

    Solution would be to poll all the open-ended pipes in a select() loop
    instead of using communicate(), or perhaps make the two communicate
    calls simultaneously in separate threads.
    Ian Kelly, Nov 13, 2012
    #11
  12. andrea crotti

    Ian Kelly Guest

    On Tue, Nov 13, 2012 at 9:07 AM, Ian Kelly <> wrote:
    > It'll look something like this:
    >
    >>>> p1 = subprocess.Popen(cmd1, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>>> p2 = subprocess.Popen(cmd2, shell=True, stdin=p1.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>>> p1.communicate()

    > ('', '')
    >>>> p2.communicate()

    > ('', '')
    >>>> p1.wait()

    > 0
    >>>> p2.wait()

    > 0
    >
    > Note that there's a subtle potential for deadlock here. During the
    > p1.communicate() call, if the p2 output buffer fills up, then it will
    > stop accepting input from p1 until p2.communicate() can be called, and
    > then if that buffer also fills up, p1 will hang. Additionally, if p2
    > needs to wait on the parent process for some reason, then you end up
    > effectively serializing the two processes.
    >
    > Solution would be to poll all the open-ended pipes in a select() loop
    > instead of using communicate(), or perhaps make the two communicate
    > calls simultaneously in separate threads.


    Sorry, the example I gave above is wrong. If you're calling
    p1.communicate(), then you need to first remove the p1.stdout pipe
    from the Popen object. Otherwise, the communicate() call will try to
    read data from it and may "steal" input from p2. It should look more
    like this:

    >>> p1 = subprocess.Popen(cmd1, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>> p2 = subprocess.Popen(cmd2, shell=True, stdin=p1.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>> p1.stdout = None
    Ian Kelly, Nov 13, 2012
    #12
  13. andrea crotti

    Ian Kelly Guest

    On Tue, Nov 13, 2012 at 9:25 AM, Ian Kelly <> wrote:
    > Sorry, the example I gave above is wrong. If you're calling
    > p1.communicate(), then you need to first remove the p1.stdout pipe
    > from the Popen object. Otherwise, the communicate() call will try to
    > read data from it and may "steal" input from p2. It should look more
    > like this:
    >
    >>>> p1 = subprocess.Popen(cmd1, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>>> p2 = subprocess.Popen(cmd2, shell=True, stdin=p1.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>>> p1.stdout = None


    Per the docs, that third line should be "p1.stdout.close()". :p
    Ian Kelly, Nov 13, 2012
    #13
  14. Ian Kelly <> writes:

    > On Tue, Nov 13, 2012 at 3:31 AM, andrea crotti
    > <> wrote:
    >> but it's a bit ugly. I wonder if I can use the subprocess PIPEs to do
    >> the same thing, is it going to be as fast and work in the same way??

    >
    > It'll look something like this:
    >
    >>>> p1 = subprocess.Popen(cmd1, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>>> p2 = subprocess.Popen(cmd2, shell=True, stdin=p1.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    >>>> p1.communicate()

    > ('', '')
    >>>> p2.communicate()

    > ('', '')
    >>>> p1.wait()

    > 0
    >>>> p2.wait()

    > 0
    >
    > Note that there's a subtle potential for deadlock here. During the
    > p1.communicate() call, if the p2 output buffer fills up, then it will
    > stop accepting input from p1 until p2.communicate() can be called, and
    > then if that buffer also fills up, p1 will hang. Additionally, if p2
    > needs to wait on the parent process for some reason, then you end up
    > effectively serializing the two processes.
    >
    > Solution would be to poll all the open-ended pipes in a select() loop
    > instead of using communicate(), or perhaps make the two communicate
    > calls simultaneously in separate threads.


    Or, you could just change the p1's stderr to an io.BytesIO instance.
    Then call p2.communicate *first*.

    --
    regards,
    kushal
    Kushal Kumaran, Nov 14, 2012
    #14
  15. andrea crotti

    Ian Kelly Guest

    On Tue, Nov 13, 2012 at 11:05 PM, Kushal Kumaran
    <> wrote:
    > Or, you could just change the p1's stderr to an io.BytesIO instance.
    > Then call p2.communicate *first*.


    This doesn't seem to work.

    >>> b = io.BytesIO()
    >>> p = subprocess.Popen(["ls", "-l"], stdout=b)

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib64/python3.2/subprocess.py", line 711, in __init__
    errread, errwrite) = self._get_handles(stdin, stdout, stderr)
    File "/usr/lib64/python3.2/subprocess.py", line 1112, in _get_handles
    c2pwrite = stdout.fileno()
    io.UnsupportedOperation: fileno

    I think stdout and stderr need to be actual file objects, not just
    file-like objects.
    Ian Kelly, Nov 14, 2012
    #15
  16. Ian Kelly <> writes:

    > On Tue, Nov 13, 2012 at 11:05 PM, Kushal Kumaran
    > <> wrote:
    >> Or, you could just change the p1's stderr to an io.BytesIO instance.
    >> Then call p2.communicate *first*.

    >
    > This doesn't seem to work.
    >
    >>>> b = io.BytesIO()
    >>>> p = subprocess.Popen(["ls", "-l"], stdout=b)

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in <module>
    > File "/usr/lib64/python3.2/subprocess.py", line 711, in __init__
    > errread, errwrite) = self._get_handles(stdin, stdout, stderr)
    > File "/usr/lib64/python3.2/subprocess.py", line 1112, in _get_handles
    > c2pwrite = stdout.fileno()
    > io.UnsupportedOperation: fileno
    >
    > I think stdout and stderr need to be actual file objects, not just
    > file-like objects.


    Well, well, I was wrong, clearly. I wonder if this is fixable.

    --
    regards,
    kushal
    Kushal Kumaran, Nov 14, 2012
    #16
  17. 2012/11/14 Kushal Kumaran <>:
    >
    > Well, well, I was wrong, clearly. I wonder if this is fixable.
    >
    > --
    > regards,
    > kushal
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    But would it not be possible to use the pipe in memory in theory?
    That would be way faster and since I have in theory enough RAM it
    might be a great improvement..
    andrea crotti, Nov 14, 2012
    #17
  18. Ok this is all very nice, but:

    [andrea@andreacrotti tar_baller]$ time python2 test_pipe.py > /dev/null

    real 0m21.215s
    user 0m0.750s
    sys 0m1.703s

    [andrea@andreacrotti tar_baller]$ time ls -lR /home/andrea | cat > /dev/null

    real 0m0.986s
    user 0m0.413s
    sys 0m0.600s


    where test_pipe.py is:
    from subprocess import PIPE, Popen

    # check if doing the pipe with subprocess and with the | is the same or not

    pipe_file = open('pipefile', 'w')


    p1 = Popen('ls -lR /home/andrea', shell=True, stdout=PIPE, stderr=PIPE)
    p2 = Popen('cat', shell=True, stdin=p1.stdout, stdout=PIPE, stderr=PIPE)
    p1.stdout.close()

    print(p2.stdout.read())


    So apparently it's way slower than using this system, is this normal?
    andrea crotti, Nov 14, 2012
    #18
  19. andrea crotti

    Dave Angel Guest

    On 11/14/2012 10:56 AM, andrea crotti wrote:
    > Ok this is all very nice, but:
    >
    > [andrea@andreacrotti tar_baller]$ time python2 test_pipe.py > /dev/null
    >
    > real 0m21.215s
    > user 0m0.750s
    > sys 0m1.703s
    >
    > [andrea@andreacrotti tar_baller]$ time ls -lR /home/andrea | cat > /dev/null
    >
    > real 0m0.986s
    > user 0m0.413s
    > sys 0m0.600s
    >
    > <snip>
    >
    >
    > So apparently it's way slower than using this system, is this normal?


    I'm not sure how this timing relates to the thread, but what it mainly
    shows is that starting up the Python interpreter takes quite a while,
    compared to not starting it up.


    --

    DaveA
    Dave Angel, Nov 14, 2012
    #19
  20. 2012/11/14 Dave Angel <>:
    > On 11/14/2012 10:56 AM, andrea crotti wrote:
    >> Ok this is all very nice, but:
    >>
    >> [andrea@andreacrotti tar_baller]$ time python2 test_pipe.py > /dev/null
    >>
    >> real 0m21.215s
    >> user 0m0.750s
    >> sys 0m1.703s
    >>
    >> [andrea@andreacrotti tar_baller]$ time ls -lR /home/andrea | cat > /dev/null
    >>
    >> real 0m0.986s
    >> user 0m0.413s
    >> sys 0m0.600s
    >>
    >> <snip>
    >>
    >>
    >> So apparently it's way slower than using this system, is this normal?

    >
    > I'm not sure how this timing relates to the thread, but what it mainly
    > shows is that starting up the Python interpreter takes quite a while,
    > compared to not starting it up.
    >
    >
    > --
    >
    > DaveA
    >



    Well it's related because my program has to be as fast as possible, so
    in theory I thought that using Python pipes would be better because I
    can get easily the PID of the first process.

    But if it's so slow than it's not worth, and I don't think is the
    Python interpreter because it's more or less constantly many times
    slower even changing the size of the input..
    andrea crotti, Nov 14, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Claudio Grondi
    Replies:
    4
    Views:
    546
    Claudio Grondi
    Aug 20, 2005
  2. Replies:
    2
    Views:
    417
    Michael Hoffman
    Apr 24, 2007
  3. Ray Van Dolson
    Replies:
    0
    Views:
    312
    Ray Van Dolson
    Sep 23, 2009
  4. Ray Van Dolson
    Replies:
    0
    Views:
    794
    Ray Van Dolson
    Sep 25, 2009
  5. benoit Guyon
    Replies:
    2
    Views:
    211
    benoit Guyon
    Jul 26, 2005
Loading...

Share This Page