Strange system() slowdown when using Inline::C

rbg · Mar 5, 2009

Before reading the code below, I want to make it clear that this code
is just a *simple reproduction* of a bug I'm seeing in some
computational biology code. So please don't point out things like,
"why are you using system() to run date?!?".

The problem I'm having is that when I make the array foo[] large, the
calls to system() slow down enormously. I'm measuring this problem
using `perl -d

Prof ...`

Test case 1 below is slow, test case 2 is faster. The only difference
is LEN = (1<<26) in test case 1 and LEN = (1<<10) in test case 2. I
welcome any ideas on how to debug this bizarre problem.

Thanks!
Ross

Test case 1:
-------------------------
use Inline C;
use strict;
hello();

sub foo {
for my $i (1..1000) {
system('date > /dev/null');
}
}
foo();
__END__
__C__
#define LEN (1<<26)
static char foo[LEN];

void hello() {
int i;
printf("hello world\n");
for (i = 0; i < LEN; i++) {
foo = 0;
}
}
-------------------------

Test case 2
-------------------------
use Inline C;
use strict;
hello();

sub foo {
for my $i (1..1000) {
system('date > /dev/null');
}
}
foo();
__END__
__C__
#define LEN (1<<10)
static char foo[LEN];

void hello() {
int i;
printf("hello world\n");
for (i = 0; i < LEN; i++) {
foo = 0;
}
}
-------------------------

rbg · Mar 5, 2009

I should include the DProf output for these cases.

Test 1:

Total Elapsed Time = 12.65265 Seconds
User+System Time = 3.202649 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
90.5 2.900 2.900 1 2.9000 2.9000 main::foo

Test 2:

Total Elapsed Time = 4.697635 Seconds
User+System Time = 0.337635 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
77.0 0.260 0.260 1 0.2600 0.2600 main::foo

You can see that in test 2, foo() is roughly 10x faster.

sisyphus · Mar 5, 2009

You can see that in test 2, foo() is roughly 10x faster.

What happens if hello() is written as a perl sub (and Inline::C is not
used at all) - something like (untested):

use constant LEN => 1<<26;

sub hello {
my $i;
my @foo;
print "hello world\n";
for($i = 0; $i < LEN; $i++) {
$foo[$i] = chr(0);
}
}

Does the speed with which foo() executes still vary in the same way,
depending upon the value of LEN ?
Or does the variation occur *only* when hello() is written as an
Inline::C sub ?

Cheers,
Rob

rbg · Mar 5, 2009

Continuing the discussion with myself, it seems that the problem is
really much more basic.

This script:
--------
use strict;
my $bf = 'a' x (1<<26);

sub foo {
for my $i (1..1000) {
system('date > /dev/null');
}
}
foo();
--------

is much slower than this one than this one:

--------
use strict;
my $bf = 'a' x (1<<10);

sub foo {
for my $i (1..1000) {
system('date > /dev/null');
}
}
foo();
--------.

I can reproduce this issue ubuntu with perl 5.10.0 (and debian with
perl 5.8.8), but *not* on OS X with perl 5.8.8. It seems like it has
something to do with how the perl interpreter is cloned when system()
is called (seems that it isn't actually doing copy-on-write forking).
Anyone have a deeper understanding of this issue?

A. Sinan Unur · Mar 5, 2009

Continuing the discussion with myself, it seems that the problem is
really much more basic.

This script:
--------
use strict;
my $bf = 'a' x (1<<26);

sub foo {
for my $i (1..1000) {
system('date > /dev/null');
}
}
foo();
--------

is much slower than this one than this one:

--------
use strict;
my $bf = 'a' x (1<<10);

sub foo {
for my $i (1..1000) {
system('date > /dev/null');
}
}
foo();
--------.

I don't see that.

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Binary build 1004 [287188] provided by ActiveState

C:\DOCUME~1\asu1\LOCALS~1\Temp\q> cat t.pl
use strict;
my $bf = 'a' x (1<<26);

sub foo {
for my $i (1..1000) {
system('date /t > NUL');
}
}
foo();

TimeThis : Command Line : t.pl
TimeThis : Start Time : Thu Mar 05 18:39:11 2009
TimeThis : End Time : Thu Mar 05 18:40:11 2009
TimeThis : Elapsed Time : 00:01:00.078

C:\DOCUME~1\asu1\LOCALS~1\Temp\q> cat s.pl
use strict;
my $bf = 'a' x (1<<10);

sub foo {
for my $i (1..1000) {
system('date /t > NUL');
}
}
foo();

TimeThis : Command Line : s.pl
TimeThis : Start Time : Thu Mar 05 18:41:35 2009
TimeThis : End Time : Thu Mar 05 18:42:34 2009
TimeThis : Elapsed Time : 00:00:59.640

C:\DOCUME~1\asu1\LOCALS~1\Temp\q> cat z.pl
use strict;

sub foo {
for my $i (1..1000) {
system('date /t > NUL');
}
}
foo();

TimeThis : Command Line : z.pl
TimeThis : Start Time : Thu Mar 05 18:43:11 2009
TimeThis : End Time : Thu Mar 05 18:44:10 2009
TimeThis : Elapsed Time : 00:00:59.015

C:\DOCUME~1\asu1\LOCALS~1\Temp\q> cat u.pl
use strict;
my $bf = 'a' x (1<<26);

TimeThis : Command Line : u.pl
TimeThis : Start Time : Thu Mar 05 18:45:05 2009
TimeThis : End Time : Thu Mar 05 18:45:06 2009
TimeThis : Elapsed Time : 00:00:00.390

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

A. Sinan Unur · Mar 6, 2009

Well, no. Since this is probably a problem with fork(2) of a large
address space, and Win32 doesn't use fork(2), you wouldn't.

OK then ... I tried it on my Eee PC 701 with ArchLinux using the stock
perl v5.10.0 while downloading system updates.

time shows:

real 0m37.892s
user 0m17.959s
sys 0m17.882s

For control, omitting the large assignment to $bf yields

real 0m24.965s
user 0m17.586s
sys 0m6.416s

On the other hand, using

my $bf = 'a' x (1 << 10);

yields

real 0m24.748s
user 0m17.322s
sys 0m6.526s

Clearly, there is a slowdown with the larger $bf, but there is not a 10x
difference.

PS: Apologies for not copying and pasting.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

rbg · Mar 6, 2009

Thanks for your responses.

I guess my real question is, can this performance penalty be
eliminated? My understanding is that linux uses copy-on-write for the
fork() system call, so forking a process is very cheap, and
independent of memory usage (modulo page table copying). My
understanding at this point is that, somehow, perl has found a way to
defeat the copy-on-write gains. This seems rather unfortunate. Is
there some way to tell perl, "when I call system() don't copy the
state of the interpreter"?

thanks
-ross

tuser1 · Mar 6, 2009

Well, no. Since this is probably a problem with fork(2) of a large
address space, and Win32 doesn't use fork(2), you wouldn't.

I have looked up perldoc -f system where it says:
"[...] Does exactly the same thing as exec LIST, except that a fork is
done first, and the parent process waits for the child process to
complete [...]"

I see that system() involves a fork, but a naïve question here: Why
does system() need to fork, can it not just simply run whatever needs
running (a shell program presumably) without forking and return to the
perl program to continue the program ? -- this would avoid the
unneccessary large address space.

rbg · Mar 7, 2009

Ben -- right, you are. On the OS X system I get d_vfork='define' and
on the linux systems I get d_vfork='undef'. Is this a compile time
configuration option or is there a way to change it at runtime?

Thanks,
ross

rbg · Mar 7, 2009

Quoth rbg <[email protected]>:

It's set at perl Configure time. You could try building perl with
-Dusevfork on the Configure line, but there may be a good reason for it
being disabled by default. Make sure you run all the tests.

Ben

The linux man pages caution strongly against using vfork() -- saying
that its deprecated and should not be used because fork() is
implemented efficiently using COW. This brings me back to the same
question: if fork() is behaving efficiently, what else is perl doing
on linux that is causing the execution time of system() to grow
(significantly) with process size? Maybe I need to dig into the perl
source to answer this question...

-ross

rbg · Mar 7, 2009

And on my linux system it's implemented with clone, not fork

strace perl -e'system "true"'
<snip>
brk(0x8b9c000) = 0x8b9c000
close(3) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|
SIGCHLD, child_tidptr=0xb7e9a908) = 16784
--- SIGCHLD (Child exited) @ 0 (0) ---
close(4) = 0
rt_sigaction(SIGINT, {SIG_IGN}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN}, {SIG_DFL}, 8) = 0
waitpid(16784, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 16784
rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, NULL, 8) = 0
</snip>

Odd.

Peter J. Holzer · Mar 7, 2009

Even with COW, fork(2) still has to copy all the page tables, which
makes it linear in the size of the process address space, albeit with a
much smaller constant than without COW. vfork(2) should be more-or-less
constant time (it will be linear in the number for open file
descriptors, since those are still cloned).

This is correct. A simple C program which just calls (v)fork/waitpid in a
tight loop (the child process exits immediately) shows that the
execution time increases with the process size for fork, but is roughly
constant for vfork.

See http://www.hjp.at/os/fork-vs-vfork/ for a graph.

hp

Peter J. Holzer · Mar 7, 2009

And on my linux system it's implemented with clone, not fork

strace perl -e'system "true"'
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|
SIGCHLD, child_tidptr=0xb7e9a908) = 16784

Odd.

fork() and vfork() are just wrapper functions for clone() on Linux (the
"real" fork and vfork syscalls still exist for binary compatibility but
have been deprecated for a very long time and aren't used by any current
libc (AFAIK)).

hp

Ilya Zakharevich · Mar 8, 2009

The linux man pages caution strongly against using vfork() -- saying
that its deprecated and should not be used because fork() is
implemented efficiently using COW. This brings me back to the same
question: if fork() is behaving efficiently, what else is perl doing

As you know now, using fork() for startup of a child process was one
of the most brain-damaged decisions of designers of a certain one-user
non-multiprocessing OS for micro-computers.

When people say that "fork() is efficient" they mean just that it is
"more effecient than it was when one had no COW optimization".

Hope this helps,
Ilya

P.S. I wonder why people do not enable Perl to use the POSIX variant
of spawn(). All the necessary code is already there in DOSISH
branches of the codebase...

How to multiply two matrices of size in using inline assembly in C++	2	Mar 3, 2024
Inline::C + -T problem	3	Jan 21, 2010
Win32 Perl slowdown	0	Feb 2, 2007
Lexical Analysis on C++	1	Oct 31, 2023
Inline C	1	Aug 6, 2004
Boomer trying to learn coding in C and C++	6	Dec 16, 2022
Curious benchmark results with Inline::C	3	Dec 16, 2009
Error when combining threads and system()	10	Jan 8, 2010

Strange system() slowdown when using Inline::C

rbg

rbg

sisyphus

rbg

A. Sinan Unur

A. Sinan Unur

rbg

tuser1

rbg

rbg

rbg

Peter J. Holzer

Peter J. Holzer

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads