object references/memory access

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jul 3, 2007

I have found the stop-and-go between two processes on the same machine

On a single core CPU when only one process can be running, the
producer must get off the CPU so that the consumer may start the
draining process.

It's still not clear why you say that the producer can run "at its top
speed". You seem to be suggesting that in such a setup, the CPU would
be idle, i.e. not 100% loaded. Assuming that the consumer won't block
for something else, then both processes will run at their "top speed".
Of course, for two processes running at a single CPU, the top speed
won't be the MIPs of a single processor, as they have to share the CPU.

So when you say it leads to very poor throughput, I ask: compared
to what alternative?

Regards,
Martin

John Nagle · Jul 3, 2007

Well, I was using the regular pickle at first but then I switched to
just using repr() / eval() because the resulting string doesn't have
all the extra 's1=' and all that so it cuts down on the amount of data
I have to send for large returns when you cut out all of that
formatting. The speed of the above method is pretty high even for
really large returns and it works fine for a list of dictionaries.

OK, that's where the time is going. It's not the interprocess
communication cost, it's the marshalling cost. "repr/eval" is not
an efficient way to marshall. Try using "pack" and "unpack", if
you control both ends of the connection.

John Nagle

John Nagle · Jul 3, 2007

Steve said:
Karthik said:

[...]

I have found the stop-and-go between two processes on the same machine
leads to very poor throughput. By stop-and-go, I mean the producer and
consumer are constantly getting on and off of the CPU since the pipe
gets full (or empty for consumer). Note that a producer can't run at
its top speed as the scheduler will pull it out since it's output pipe
got filled up.

Click to expand...

This is in fact true, but the overheads of CPython are so large
that you don't usually see it. If message passing in CPython is slow,
it's usually because the marshalling cost is too high. As I mentioned
previously, try "pack" instead of "pickle" or "repr" if you control
the interface on both ends of the connection.

I've used QNX, the message-passing real time operating system,
extensively. QNX has the proper mechanisms to handle interprocess
communication efficiently; we could pipe uncompressed video through
the message passing system and only use 3% of the CPU per stream.
QNX deals with the "stop and go" problem properly; interprocess
communication via MsgSend and MsgReceive works more like a subroutine
call than a queued I/O operation. In the normal case, you pay for
a byte copy and a context switch for a message pass, but it's
not necessary to take a trip through the scheduler.

Not many operating systems get this right. Minix 3 does,
and there are infrequently used facilities in NT that almost do.
But pipes, sockets, and System V IPC as in Linux all take you through
the scheduler extra times. This is a killer if there's another compute
bound process running; on each message pass you lose your turn for
the CPU.

(There have been worse operating systems. The classic MacOS allows
you one (1) message pass per clock tick, or 60 messages per second.)

John Nagle

Karthik Gurusamy · Jul 3, 2007

It's still not clear why you say that the producer can run "at its top
speed". You seem to be suggesting that in such a setup, the CPU would
be idle, i.e. not 100% loaded. Assuming that the consumer won't block
for something else, then both processes will run at their "top speed".
Of course, for two processes running at a single CPU, the top speed
won't be the MIPs of a single processor, as they have to share the CPU.

So when you say it leads to very poor throughput, I ask: compared
to what alternative?

Let's assume two processes P and C. P is the producer of data; C, the
consumer.
To answer your specific question, compared to running P to completion
and then running C to completion. The less optimal way is p1-->c1--

p2-->c2-->..... p_n---c_n where p1 is a time-slice when P is on CPU,

c1 is a time-slice when c1 is on CPU.

If the problem does not require two way communication, which is
typical of a producer-consumer, it is a lot faster to allow P to fully
run before C is started.

If P and C are tied using a pipe, in most linux like OS (QNX may be
doing something really smart as noted by John Nagle), there is a big
cost of scheduler swapping P and C constantly to use the CPU. You may
ask why? because the data flowing between P and C, has a small finite
space (the buffer). Once P fills it; it will block -- the scheduler
sees C is runnable and puts C on the CPU.

Thus even if CPU is 100% busy, useful work is not 100%; the process
swap overhead can kill the performance.

When we use an intermediate file to capture the data, we allow P to
run a lot bigger time-slice. Assuming huge file-system buffering, it's
very much possible P gets one-go on the CPU and finishes it's job of
data generation.

Note that all these become invalid, if you have a more than one core
and the scheduler can keep both P and C using two cores
simulateanously. If that is the case, we don't incur this process-swap
overhead and we may not see the stop-n-go performance drop.

Thanks,
Karthik

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jul 3, 2007

If the problem does not require two way communication, which is

typical of a producer-consumer, it is a lot faster to allow P to fully
run before C is started.

Why do you say it's *a lot* faster. I find that it is a little faster.
The only additional overhead from switching forth and back between
consumer and producer is the overhead for context switching, which
is typically negligible, compared to everything else that is going
on.

Regards,
Martin

Karthik Gurusamy · Jul 3, 2007

Why do you say it's *a lot* faster. I find that it is a little faster.
The only additional overhead from switching forth and back between
consumer and producer is the overhead for context switching, which
is typically negligible, compared to everything else that is going
on.

True it needn't be *a lot*. I did observe 25% gain or more when there
were a chain of processes involved as in a shell pipeline. Again this
could be very problem specific. What I had, included something like 4
or 5 processes connected as in p1 | p2 | p3 | p4 ... here I found the
back-n-forth context switching was slowing down quite a bit (some
thing like 2 mins task completed in under 40 seconds without the
piping)

If all you had is just two processes, P and C and the amount of data
flowing is less (say on the order of 10's of buffer-size ... e.g. 20
times 4k), *a lot* may not be right quantifier. But if the data is
large and several processes are involved, I am fairly sure the
overhead of context-switching is very significant (not negligible) in
the final throughput.

Thanks,
Karthik

Terry Reedy · Jul 4, 2007

|If all you had is just two processes, P and C and the amount of data
|flowing is less (say on the order of 10's of buffer-size ... e.g. 20
|times 4k), *a lot* may not be right quantifier.

Have pipe buffer sizes really not been scaled up with RAM sizes?
4K on a 4M machine is sensible, but on my 1G machine, up to 1M might be ok.

tjr

John Nagle · Jul 4, 2007

Karthik said:
If the problem does not require two way communication, which is
typical of a producer-consumer, it is a lot faster to allow P to fully
run before C is started.

If P and C are tied using a pipe, in most linux like OS (QNX may be
doing something really smart as noted by John Nagle), there is a big
cost of scheduler swapping P and C constantly to use the CPU. You may
ask why? because the data flowing between P and C, has a small finite
space (the buffer). Once P fills it; it will block -- the scheduler
sees C is runnable and puts C on the CPU.

The killer case is where there's another thread or process other than C
already ready to run when P blocks. The other thread, not C, usually
gets control, because it was ready to run first, and not until the other
thread runs out its time quantum does C get a turn. Then C gets to
run briefly, drains out the pipe, and blocks. P gets to run,
fills the pipe, and blocks. The compute-bound thread gets to run,
runs for a full time quantum, and loses the CPU to C. Wash,
rinse, repeat.

The effect is that pipe-like producer-consumer systems may get only a small
fraction of the available CPU time on a busy system.

When testing a producer-consumer system, put a busy loop in the
background and see if performance becomes terrible. It ought to
drop by 50% against an equal-priority compute bound process; if
it drops by far more than that, you have the problem described here.

This problem is sometimes called "What you want is a subroutine call;
what the OS gives you is an I/O operation." When you make a subroutine
call on top of an I/O operation, you get these scheduling problems.

John Nagle

greg · Jul 5, 2007

John said:
C gets to
run briefly, drains out the pipe, and blocks. P gets to run,
fills the pipe, and blocks. The compute-bound thread gets to run,
runs for a full time quantum, and loses the CPU to C. Wash,
rinse, repeat.

I thought that unix schedulers were usually a bit more
intelligent than that, and would dynamically lower the
priority of processes using CPU heavily.

If it worked purely as you describe, then a CPU-bound
process would make any interactive application running
at the same time very unresponsive. That doesn't seem
to happen on any of today's desktop unix systems.

Dennis Lee Bieber · Jul 5, 2007

I thought that unix schedulers were usually a bit more
intelligent than that, and would dynamically lower the
priority of processes using CPU heavily.

Think VMS was the most applicable for that behavior... Haven't seen
any dynamic priorities on the UNIX/Linux/Solaris systems I've
encountered...

If it worked purely as you describe, then a CPU-bound
process would make any interactive application running
at the same time very unresponsive. That doesn't seem
to happen on any of today's desktop unix systems.

If a process is known to be CPU bound, I think it is typical
practice to "nice" the process... Lowering its priority by direct
action.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Alex Martelli · Jul 7, 2007

Dennis Lee Bieber said:
Think VMS was the most applicable for that behavior... Haven't seen
any dynamic priorities on the UNIX/Linux/Solaris systems I've
encountered...

Dynamic priority scheduling is extremely common in Unixen today (and has
been for many years); see e.g.
<http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=220>
for the _enhancements_ FreeBSD 5.2 brought to this idea a few years ago,
for example.

Alex

greg · Jul 10, 2007

Dennis said:
If a process is known to be CPU bound, I think it is typical
practice to "nice" the process... Lowering its priority by direct
action.

Yes, but one usually only bothers with this for long-running
tasks. It's a nicety, not an absolute requirement.

It seems like this would have been an even more important
issue in the timesharing environments where unix originated.
You wouldn't want everyone's text editors suddenly starting
to take half a second to respond to keystrokes just because
someone launched "cc -O4 foo.c" without nicing it.

Updating JSON object	1	Aug 12, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
How to get all values of an object	1	Mar 26, 2022
Generator using item[n-1] + item[n] memory	0	Feb 14, 2014
A simple form question	2	Nov 7, 2023
Debugging memory leaks	20	Jun 13, 2013
Object cleanup	4	May 30, 2012
Question about object lifetime and access	7	Jan 15, 2014

object references/memory access

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John Nagle

John Nagle

Karthik Gurusamy

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Karthik Gurusamy

Terry Reedy

John Nagle

greg

Dennis Lee Bieber

Alex Martelli

greg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads