G
Glenn Linderman
Actually, the GIL doesn't make Python faster; it is a design decision
that reduces the overhead of lock acquisition, while still allowing use
of global variables.
Using finer-grained locks has higher run-time cost; eliminating the use
of global variables has a higher programmer-time cost, but would
actually run faster and more concurrently than using a GIL. Especially
on a multi-core/multi-CPU machine.
So how do you get reentrancy is a single-threaded sequential program? I
think only via recursion? Which isn't a serious issue for the observer
pattern. If you add interrupts, then your program is no longer sequential.
I understand that level... one of my degrees is in EE, and I started
college wanting to design computers (at about the time the first
microprocessor chip came along, and they, of course, have now taken
over). But I was side-lined by the malleability of software, and have
mostly practiced software during my career.
Anyway, that is the level that Herb Sutter was describing in the Dr
Dobbs articles I mentioned. And the overhead of doing that at the level
of a cache line is high, if there is lots of contention for particular
memory locations between threads running on different cores/CPUs. So to
achieve concurrency, you must not only limit explicit software locks,
but must also avoid memory layouts where data needed by different
cores/CPUs are in the same cache line.
I agree there are tradeoffs... unfortunately, the hardware architectures
vary, and the languages don't generally understand the hardware. So then
it becomes an OS API, which adds the overhead of an OS API call to the
cost of the synchronization... It could instead be (and in clever
applications is) a non-portable assembly level function that wraps on OS
locking or waiting API.
Nonetheless, while putting the shared data accesses in hardware might be
more efficient per unit operation, there are still tradeoffs: A software
solution can group multiple accesses under a single lock acquisition;
the hardware probably doesn't have enough smarts to do that. So it may
well require many more hardware unit operations for the same overall
concurrently executed function, and the resulting performance may not be
any better.
Sidestepping the whole issue, by minimizing shared data in the
application design, avoiding not only software lock calls, and hardware
cache contention, is going to provide the best performance... it isn't
the things you do efficiently that make software fast — it is the things
you don't do at all.
that reduces the overhead of lock acquisition, while still allowing use
of global variables.
Using finer-grained locks has higher run-time cost; eliminating the use
of global variables has a higher programmer-time cost, but would
actually run faster and more concurrently than using a GIL. Especially
on a multi-core/multi-CPU machine.
Another peeve I have is his characterization of the observer pattern.
The generalized form of the problem exists in both single-threaded
sequential programs, in the form of unexpected reentrancy, and message
passing, with infinite CPU usage or infinite number of pending
messages.
So how do you get reentrancy is a single-threaded sequential program? I
think only via recursion? Which isn't a serious issue for the observer
pattern. If you add interrupts, then your program is no longer sequential.
Try looking at it on another level: when your CPU wants to read from a
bit of memory controlled by another CPU it sends them a message
requesting they get it for us. They send back a message containing
that memory. They also note we have it, in case they want to modify
it later. We also note where we got it, in case we want to modify it
(and not wait for them to do modifications for us).
I understand that level... one of my degrees is in EE, and I started
college wanting to design computers (at about the time the first
microprocessor chip came along, and they, of course, have now taken
over). But I was side-lined by the malleability of software, and have
mostly practiced software during my career.
Anyway, that is the level that Herb Sutter was describing in the Dr
Dobbs articles I mentioned. And the overhead of doing that at the level
of a cache line is high, if there is lots of contention for particular
memory locations between threads running on different cores/CPUs. So to
achieve concurrency, you must not only limit explicit software locks,
but must also avoid memory layouts where data needed by different
cores/CPUs are in the same cache line.
Message passing vs shared memory isn't really a yes/no question. It's
about ratios, usage patterns, and tradeoffs. *All* programs will
share data, but in what way? If it's just the code itself you can
move the cache validation into software and simplify the CPU, making
it faster. If the shared data is a lot more than that, and you use it
to coordinate accesses, then it'll be faster to have it in hardware.
I agree there are tradeoffs... unfortunately, the hardware architectures
vary, and the languages don't generally understand the hardware. So then
it becomes an OS API, which adds the overhead of an OS API call to the
cost of the synchronization... It could instead be (and in clever
applications is) a non-portable assembly level function that wraps on OS
locking or waiting API.
Nonetheless, while putting the shared data accesses in hardware might be
more efficient per unit operation, there are still tradeoffs: A software
solution can group multiple accesses under a single lock acquisition;
the hardware probably doesn't have enough smarts to do that. So it may
well require many more hardware unit operations for the same overall
concurrently executed function, and the resulting performance may not be
any better.
Sidestepping the whole issue, by minimizing shared data in the
application design, avoiding not only software lock calls, and hardware
cache contention, is going to provide the best performance... it isn't
the things you do efficiently that make software fast — it is the things
you don't do at all.