creating size-limited tar files

Dave Angel · Nov 14, 2012

2012/11/14 Dave Angel said:
2012/11/14 Dave Angel said:

Ok this is all very nice, but:

[andrea@andreacrotti tar_baller]$ time python2 test_pipe.py > /dev/null

real 0m21.215s
user 0m0.750s
sys 0m1.703s

[andrea@andreacrotti tar_baller]$ time ls -lR /home/andrea | cat > /dev/null

real 0m0.986s
user 0m0.413s
sys 0m0.600s

<snip>

So apparently it's way slower than using this system, is this normal?

Click to expand...

I'm not sure how this timing relates to the thread, but what it mainly
shows is that starting up the Python interpreter takes quite a while,
compared to not starting it up.

Click to expand...

Well it's related because my program has to be as fast as possible, so
in theory I thought that using Python pipes would be better because I
can get easily the PID of the first process.

But if it's so slow than it's not worth, and I don't think is the
Python interpreter because it's more or less constantly many times
slower even changing the size of the input..

Well, as I said, I don't see how the particular timing has anything to
do with the rest of the thread. If you want to do an ls within a Python
program, go ahead. But if all you need can be done with ls itself, then
it'll be slower to launch python just to run it.

Your first timing runs python, which runs two new shells, ls, and cat.
Your second timing runs ls and cat.

So the difference is starting up python, plus starting the shell two
extra times.

I'd also be curious if you flushed the system buffers before each
timing, as the second test could be running entirely in system memory.
And no, I don't know offhand how to flush them in Linux, just that
without it, your timings are not at all repeatable. Note the two
identical runs here.

davea@think:~/temppython$ time ls -lR ~ | cat > /dev/null

real 0m0.164s
user 0m0.020s
sys 0m0.000s
davea@think:~/temppython$ time ls -lR ~ | cat > /dev/null

real 0m0.018s
user 0m0.000s
sys 0m0.010s

real time goes down by 90%, while user time drops to zero.
And on a 3rd and subsequent run, sys time goes to zero as well.

Andrea Crotti · Nov 14, 2012

Well, as I said, I don't see how the particular timing has anything to
do with the rest of the thread. If you want to do an ls within a Python
program, go ahead. But if all you need can be done with ls itself, then
it'll be slower to launch python just to run it.

Your first timing runs python, which runs two new shells, ls, and cat.
Your second timing runs ls and cat.

So the difference is starting up python, plus starting the shell two
extra times.

I'd also be curious if you flushed the system buffers before each
timing, as the second test could be running entirely in system memory.
And no, I don't know offhand how to flush them in Linux, just that
without it, your timings are not at all repeatable. Note the two
identical runs here.

davea@think:~/temppython$ time ls -lR ~ | cat > /dev/null

real 0m0.164s
user 0m0.020s
sys 0m0.000s
davea@think:~/temppython$ time ls -lR ~ | cat > /dev/null

real 0m0.018s
user 0m0.000s
sys 0m0.010s

real time goes down by 90%, while user time drops to zero.
And on a 3rd and subsequent run, sys time goes to zero as well.

Right I didn't think about that..
Anyway the only thing I wanted to understand is if using the pipes in
subprocess is exactly the same as doing
the Linux pipe, or not.

And any idea on how to run it in ram?
Maybe if I create a pipe in tmpfs it might already work, what do you think?

Dave Angel · Nov 14, 2012

<SNIP>
Anyway the only thing I wanted to understand is if using the pipes in
subprocess is exactly the same as doing
the Linux pipe, or not.

It's not the same thing, but you can usually assume it's close. Other
effects will probably dominate any differences.

And any idea on how to run it in ram?
Maybe if I create a pipe in tmpfs it might already work, what do you think?

In a good virtual OS, such as Linux, there's very little predictable
difference between running in RAM (which is to say reading and writing
to the swap file) or reading and writing to a file you specify. In
fact, writing to a file can frequently be quicker, if it's sequential.

Why? Linux is using any given piece of physical RAM to map a file, or
an allocated buffer, or shared memory, or nearly anything. About the
only special cases are the kind of RAM that has to be locked into RAM
for hardware reasons.

Linux decides which pieces to keep in memory, whether it calls it
caching, swapping, memory mapping, or whatever. And frequently,
attempts to "beat the system" result in counterintuitive results.

If in doubt, measure. But choose your measures carefully, because lots
more things will change the measurement than you might expect.

creating tar file and streaming it over HTTP?	8	Jan 6, 2010
Archive::Tar, difference in size of output file	3	Jul 6, 2010
Version of TAR in tarfile module? TAR 1.14 or 1.15 port to Windows?	4	Aug 20, 2005
How to extract .tar files in different directory?	7	Feb 15, 2007
Java support of GNU Tar	1	Feb 23, 2007
Tips for using Github???	3	Jan 6, 2024
Portable general timestamp format, not 2038-limited	77	Jun 22, 2007
Gridview fix cell size.	0	Apr 18, 2012

creating size-limited tar files

Dave Angel

Andrea Crotti

Dave Angel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads