Threads vs Processes

  • Thread starter Carl J. Van Arsdall
  • Start date
C

Carl J. Van Arsdall

Alright, based a on discussion on this mailing list, I've started to
wonder, why use threads vs processes. So, If I have a system that has a
large area of shared memory, which would be better? I've been leaning
towards threads, I'm going to say why.

Processes seem fairly expensive from my research so far. Each fork
copies the entire contents of memory into the new process. There's also
a more expensive context switch between processes. So if I have a
system that would fork 50+ child processes my memory usage would be huge
and I burn more cycles that I don't have to. I understand that there
are ways of IPC, but aren't these also more expensive?

So threads seems faster and more efficient for this scenario. That
alone makes me want to stay with threads, but I get the feeling from
people on this list that processes are better and that threads are over
used. I don't understand why, so can anyone shed any light on this?


Thanks,

-carl

--

Carl J. Van Arsdall
(e-mail address removed)
Build and Release
MontaVista Software
 
C

Chance Ginger

Alright, based a on discussion on this mailing list, I've started to
wonder, why use threads vs processes. So, If I have a system that has a
large area of shared memory, which would be better? I've been leaning
towards threads, I'm going to say why.

Processes seem fairly expensive from my research so far. Each fork
copies the entire contents of memory into the new process. There's also
a more expensive context switch between processes. So if I have a
system that would fork 50+ child processes my memory usage would be huge
and I burn more cycles that I don't have to. I understand that there
are ways of IPC, but aren't these also more expensive?

So threads seems faster and more efficient for this scenario. That
alone makes me want to stay with threads, but I get the feeling from
people on this list that processes are better and that threads are over
used. I don't understand why, so can anyone shed any light on this?


Thanks,

-carl

Not quite that simple. In most modern OS's today there is something
called COW - copy on write. What happens is when you fork a process
it will make an identical copy. Whenever the forked process does
write will it make a copy of the memory. So it isn't quite as bad.

Secondly, with context switching if the OS is smart it might not
flush the entire TLB. Since most applications are pretty "local" as
far as execution goes, it might very well be the case the page (or
pages) are already in memory.

As far as Python goes what you need to determine is how much
real parallelism you want. Since there is a global lock in Python
you will only execute a few (as in tens) instructions before
switching to the new thread. In the case of true process you
have two independent Python virtual machines. That may make things
go much faster.

Another issue is the libraries you use. A lot of them aren't
thread safe. So you need to watch out.

Chance
 
J

John Henry

Chance said:
Not quite that simple. In most modern OS's today there is something
called COW - copy on write. What happens is when you fork a process
it will make an identical copy. Whenever the forked process does
write will it make a copy of the memory. So it isn't quite as bad.

Secondly, with context switching if the OS is smart it might not
flush the entire TLB. Since most applications are pretty "local" as
far as execution goes, it might very well be the case the page (or
pages) are already in memory.

As far as Python goes what you need to determine is how much
real parallelism you want. Since there is a global lock in Python
you will only execute a few (as in tens) instructions before
switching to the new thread. In the case of true process you
have two independent Python virtual machines. That may make things
go much faster.

Another issue is the libraries you use. A lot of them aren't
thread safe. So you need to watch out.

Chance

It's all about performance (and sometimes the "perception" of
performance). Eventhough the thread support (and performance) in
Python is fairly weak (as explained by Chance), it's nonetheless very
useful. My applications threads a lot and it proves to be invaluable -
particularly with GUI type applications. I am the type of user that
gets annoyed very quickly and easily if the program doesn't respond to
me when I click something. So, as a rule of thumb, if the code has to
do much of anything that takes say a tenth of a second or more, I
thread.

I posted a simple demo program yesterday to the Pythoncard list to show
why somebody would want to thread an app. You can properly see it from
archive.
 
P

Paul Rubin

Carl J. Van Arsdall said:
Processes seem fairly expensive from my research so far. Each fork
copies the entire contents of memory into the new process.

No, you get two processes whose address spaces get the data. It's
done with the virtual memory hardware. The data isn't copied. The
page tables of both processes are just set up to point to the same
physical pages. Copying only happens if a process writes to one of
the pages. The OS detects this using a hardware trap from the VM
system.
 
R

Russell Warren

Another issue is the libraries you use. A lot of them aren't
thread safe. So you need to watch out.

This is something I have a streak of paranoia about (after discovering
that the current xmlrpclib has some thread safety issues). Is there a
list maintained anywhere of the modules that are aren't thread safe?

Russ
 
R

Russell Warren

Oops - minor correction... xmlrpclib is fine (I think/hope). It is
SimpleXMLRPCServer that currently has issues. It uses
thread-unfriendly sys.exc_value and sys.exc_type... this is being
corrected.
 
P

Paddy

Carl said:
Alright, based a on discussion on this mailing list, I've started to
wonder, why use threads vs processes. So, If I have a system that has a
large area of shared memory, which would be better? I've been leaning
towards threads, I'm going to say why.

Processes seem fairly expensive from my research so far. Each fork
copies the entire contents of memory into the new process. There's also
a more expensive context switch between processes. So if I have a
system that would fork 50+ child processes my memory usage would be huge
and I burn more cycles that I don't have to. I understand that there
are ways of IPC, but aren't these also more expensive?

So threads seems faster and more efficient for this scenario. That
alone makes me want to stay with threads, but I get the feeling from
people on this list that processes are better and that threads are over
used. I don't understand why, so can anyone shed any light on this?


Thanks,

-carl

--

Carl J. Van Arsdall
(e-mail address removed)
Build and Release
MontaVista Software

Carl,
OS writers provide much more tools for debugging, tracing, changing
the priority of, sand-boxing processes than threads (in general) It
*should* be easier to get a process based solution up and running
andhave it be more robust, when compared to a threaded solution.

- Paddy (who shies away from threads in C and C++ too ;-)
 
J

John Henry

Carl,
OS writers provide much more tools for debugging, tracing, changing
the priority of, sand-boxing processes than threads (in general) It
*should* be easier to get a process based solution up and running
andhave it be more robust, when compared to a threaded solution.

- Paddy (who shies away from threads in C and C++ too ;-)

That mythical "process" is more robust then "thread" application
paradigm again.

No wonder there are so many boring software applications around.

Granted. Threaded program forces you to think and design your
application much more carefully (to avoid race conditions, dead-locks,
....) but there is nothing inherently *non-robust* about threaded
applications.
 
G

Gerhard Fiedler

Granted. Threaded program forces you to think and design your
application much more carefully (to avoid race conditions, dead-locks,
...) but there is nothing inherently *non-robust* about threaded
applications.

You just need to make sure that every piece of code you're using is
thread-safe. While OTOH to make sure they are all "process safe" is the job
of the OS, so to speak :)

Gerhard
 
J

Joe Knapka

John said:
That mythical "process" is more robust then "thread" application
paradigm again.

No wonder there are so many boring software applications around.

Granted. Threaded program forces you to think and design your
application much more carefully (to avoid race conditions, dead-locks,
...) but there is nothing inherently *non-robust* about threaded
applications.

In this particular case, the OP (in a different thread)
mentioned that his application will be extended by
random individuals who can't necessarily be trusted
to design their extensions correctly. In that case,
segregating the untrusted code, at least, into
separate processes seems prudent.

The OP also mentioned that:
If I have a system that has a large area of shared memory,
> which would be better?

IMO, if you're going to be sharing data structures with
code that can't be trusted to clean up after itself,
you're doomed. There's just no way to make that
scenario work reliably. The best you can do is insulate
that data behind an API (rather than giving untrusted
code direct access to the data -- IOW, don't use threads,
because if you do, they can go around your API and screw
things up), and ensure that each API call leaves the
data structures in a consistent state.

-- JK
 
B

bryanjugglercryptographer

Carl said:
Alright, based a on discussion on this mailing list, I've started to
wonder, why use threads vs processes.

In many cases, you don't have a choice. If your Python program
is to run other programs, the others get their own processes.
There's no threads option on that.

If multiple lines of execution need to share Python objects,
then the standard Python distribution supports threads, while
processes would require some heroic extension. Don't confuse
sharing memory, which is now easy, with sharing Python
objects, which is hard.

So, If I have a system that has a
large area of shared memory, which would be better? I've been leaning
towards threads, I'm going to say why.

Processes seem fairly expensive from my research so far. Each fork
copies the entire contents of memory into the new process.

As others have pointed out, not usually true with modern OS's.
There's also
a more expensive context switch between processes. So if I have a
system that would fork 50+ child processes my memory usage would be huge
and I burn more cycles that I don't have to.

Again, not usually true. Modern OS's share code across
processes. There's no way to tell the size of 100
unspecified processes, but the number is nothing special.
So threads seems faster and more efficient for this scenario. That
alone makes me want to stay with threads, but I get the feeling from
people on this list that processes are better and that threads are over
used. I don't understand why, so can anyone shed any light on this?

Yes, someone can, and that someone might as well be you.
How long does it take to create and clean up 100 trivial
processes on your system? How about 100 threads? What
portion of your user waiting time is that?
 
D

Dennis Lee Bieber

Processes seem fairly expensive from my research so far. Each fork
copies the entire contents of memory into the new process. There's also

No, it does NOT. A UNIX type fork() creates a process header: PID,
might allocate a heap, allocates a new stack. BUT, the executable code
itself is consider read-only and SHARED. Normally, one of the first
things a fork()'d process does is confirm it is the child (the fork() in
the new process returned a "0") and invokes a service that tells the OS
to load a new executable; this is when the memory load takes place.
a more expensive context switch between processes. So if I have a
system that would fork 50+ child processes my memory usage would be huge

Only if they are using pure and unique code... If all the child
processes use the same shared library, say, then no new memory is needed
beyond the stack/heap.
So threads seems faster and more efficient for this scenario. That
alone makes me want to stay with threads, but I get the feeling from
people on this list that processes are better and that threads are over
used. I don't understand why, so can anyone shed any light on this?
Depending on the OS, the only difference between a thread and a
process may be the memory protection and "environment" (ie,
stdin/stdout).
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
S

sjdevnull

John said:
That mythical "process" is more robust then "thread" application
paradigm again.

No wonder there are so many boring software applications around.

Granted. Threaded program forces you to think and design your
application much more carefully (to avoid race conditions, dead-locks,
...) but there is nothing inherently *non-robust* about threaded
applications.

Indeed. Let's just get rid of all preemptive multitasking while we're
at it; MacOS9's cooperative, non-memory-protected system wasn't
inherently worse as long as every application was written properly.
There was nothing inherently non-robust about it!

The key difference between threads and processes is that threads share
all their memory, while processes have memory protection except with
particular segments of memory they choose to share.

The next most important difference is that certain languages have
different support for threads/procs. If you're writing a Python
application, you need to be aware of the GIL and its implications on
multithreaded performance. If you're writing a Java app, you're
handicapped by the lack of support for multiprocess solutions.

The third most important difference--and it's a very distant
difference--is the performance difference. In practice, most
well-designed systems will be pooling threads/procs and so startup time
is not that critical. For some apps, it may be. Context switching
time may differ, and likewise that is not usually a sticking point but
for particular programs it can be. On some OSes, launching a
copy-on-write process is difficult--that used to be a reason to choose
threads over procs on Windows, but nowadays all modern Windows OSes
offer a CreateProcessEx call that allows full-on COW processes.

In general, though, if you want to share _all_ memory or if you have
measured and context switching sucks on your OS and is a big factor in
your application, use threads. In general, if you don't know exactly
why you're choosing one or the other, or if you want memory protection,
robustness in the face of programming errors, access to more 3rd-party
libraries, etc, then you should choose a multiprocess solution.

(OS designers spent years of hard work writing OSes with protected
memory--why voluntarily throw that out?)
 
S

sjdevnull

Russell said:
This is something I have a streak of paranoia about (after discovering
that the current xmlrpclib has some thread safety issues). Is there a
list maintained anywhere of the modules that are aren't thread safe?


It's much safer to work the other way: assume that libraries are _not_
thread safe unless they're listed as such. Even things like the
standard C library on mainstream Linux distributions are only about 7
years into being thread-safe by default, anything at all esoteric you
should assume is not until you investigate and find documentation to
the contrary.
 
S

sjdevnull

Indeed. Let's just get rid of all preemptive multitasking while we're
at it

Also, race conditions and deadlocks are equally bad in multiprocess
solutions as in multithreaded ones. Any time you're doing parallel
processing you need to consider them.

I'd actually submit that initially writing multiprocess programs
requires more design and forethought, since you need to determine
exactly what you want to share instead of just saying "what the heck,
everything's shared!" The payoff in terms of getting _correct_
behavior more easily, having much easier maintenance down the line, and
being more robust in the face of program failures (or unforseen
environment issues) is usually well worth it, though there are
certainly some applications where threads are a better choice.
 
N

Nick Craig-Wood

Yes, someone can, and that someone might as well be you.
How long does it take to create and clean up 100 trivial
processes on your system? How about 100 threads? What
portion of your user waiting time is that?

Here is test prog...

The results are on my 2.6GHz P4 linux system

Forking
1000 loops, best of 3: 546 usec per loop
Threading
10000 loops, best of 3: 199 usec per loop

Indicating that starting up and tearing down new threads is 2.5 times
quicker than starting new processes under python.

This is probably irrelevant in the real world though!


"""
Time threads vs fork
"""

import os
import timeit
import threading

def do_child_stuff():
"""Trivial function for children to run"""
# print "hello from child"
pass

def fork_test():
"""Test forking"""
pid = os.fork()
if pid == 0:
# child
do_child_stuff()
os._exit(0)
# parent - wait for child to finish
os.waitpid(pid, os.P_WAIT)

def thread_test():
"""Test threading"""
t = threading.Thread(target=do_child_stuff)
t.start()
# wait for child to finish
t.join()

def main():
print "Forking"
timeit.main(["-s", "from __main__ import fork_test", "fork_test()"])
print "Threading"
timeit.main(["-s", "from __main__ import thread_test", "thread_test()"])

if __name__ == "__main__":
main()
 
S

Steve Holden

Carl said:
Ah, alright. So if that's the case, why would you use python threads
versus spawning processes? If they both point to the same address space
and python threads can't run concurrently due to the GIL what are they
good for?

Well, of course they can interleave essentially independent
computations, which is why threads (formerly "lightweight processes")
were traditionally defined.

Further, some thread-safe extension (compiled) libraries will release
the GIL during their work, allowing other threads to execute
simultaneously - and even in parallel on multi-processor hardware.

regards
Steve
 
G

Gerhard Fiedler

Ah, alright. So if that's the case, why would you use python threads
versus spawning processes? If they both point to the same address space
and python threads can't run concurrently due to the GIL what are they
good for?

Nothing runs concurrently on a single core processor (pipelining aside).
Processes don't run any more concurrently than threads. The scheduling is
different, but they still run sequentially.

Gerhard
 
C

Carl J. Van Arsdall

In many cases, you don't have a choice. If your Python program
is to run other programs, the others get their own processes.
There's no threads option on that.

If multiple lines of execution need to share Python objects,
then the standard Python distribution supports threads, while
processes would require some heroic extension. Don't confuse
sharing memory, which is now easy, with sharing Python
objects, which is hard.
Ah, alright, I think I understand, so threading works well for sharing
python objects. Would a scenario for this be something like a a job
queue (say Queue.Queue) for example. This is a situation in which each
process/thread needs access to the Queue to get the next task it must
work on. Does that sound right? Would the same apply to multiple
threads needed access to a dictionary? list?

Now if you are just passing ints and strings around, use processes with
some type of IPC, does that sound right as well? Or does the term
"shared memory" mean something more low-level like some bits that don't
necessarily mean anything to python but might mean something to your
application?

Sorry if you guys think i'm beating this to death, just really trying to
get a firm grasp on what you are telling me and again, thanks for taking
the time to explain all of this to me!

-carl


--

Carl J. Van Arsdall
(e-mail address removed)
Build and Release
MontaVista Software
 
J

John Henry

Also, race conditions and deadlocks are equally bad in multiprocess
solutions as in multithreaded ones. Any time you're doing parallel
processing you need to consider them.

Only in the sense that you are far more likely to be dealing with
shared resources in a multi-threaded application. When I start a
sub-process, I know I am doing that to *avoid* resource sharing. So,
the chance of a dead-lock is less - only because I would do it far
less.
I'd actually submit that initially writing multiprocess programs
requires more design and forethought, since you need to determine
exactly what you want to share instead of just saying "what the heck,
everything's shared!" The payoff in terms of getting _correct_
behavior more easily, having much easier maintenance down the line, and
being more robust in the face of program failures (or unforseen
environment issues) is usually well worth it, though there are
certainly some applications where threads are a better choice.

If you're sharing things, I would thread. I would not want to pay the
expense of a process.

It's too bad that programmers are not threading more often.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top