Serious concurrency problems on fast systems

Kevin McMurtrie · Jun 1, 2010

I've been assisting in load testing some new high performance servers
running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
suspending threads for time-slicing in very unfortunate locations. For
example, a thread might suspend in Hashtable.get(Object) after a call to
getProperty(String) on the system properties. It's a synchronized
global so a few hundred threads might pile up until the lock holder
resumes. Odds are that those hundreds of threads won't finish before
another one stops to time slice again. The performance hit has a ton of
hysteresis so the server doesn't recover until it has a lower load than
before the backlog started.

The brute force fix is of course to eliminate calls to shared
synchronized objects. All of the easy stuff has been done. Some
operations aren't well suited to simple CAS. Bottlenecks that are part
of well established Java APIs are time consuming to fix/avoid.

Is there JVM or Linux tuning that will change the behavior of thread
time slicing or preemption? I checked the JDK 6 options page but didn't
find anything that appears to be applicable.

Lew · Jun 1, 2010

Kevin said:
I've been assisting in load testing some new high performance servers
running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
suspending threads for time-slicing in very unfortunate locations. For
example, a thread might suspend in Hashtable.get(Object) after a call to
getProperty(String) on the system properties. It's a synchronized
global so a few hundred threads might pile up until the lock holder
resumes. Odds are that those hundreds of threads won't finish before
another one stops to time slice again. The performance hit has a ton of
hysteresis so the server doesn't recover until it has a lower load than
before the backlog started.

The brute force fix is of course to eliminate calls to shared
synchronized objects. All of the easy stuff has been done. Some

You call that "brute force" as if it weren't the actual, correct answer.

operations aren't well suited to simple CAS. Bottlenecks that are part
of well established Java APIs are time consuming to fix/avoid.

But necessary. Having repeated calls on system properties that require
synchronization is just plain stupid. System properties are the ones that
don't change during a program run, so they should be (and should have been)
written once into an immutable structure at class-load time and read thence
thereafter. End of synchronization woes for that one.

Is there JVM or Linux tuning that will change the behavior of thread
time slicing or preemption? I checked the JDK 6 options page but didn't
find anything that appears to be applicable.

The jmap/jhat dump utilities have some of that, IIRC. Otherwise you break
into the add-on diagnostic tools.

But really it sounds like your code needs refactoring in that it did not
handle concurrency correctly from jump.

Tom Anderson · Jun 1, 2010

I've been assisting in load testing some new high performance servers
running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
suspending threads for time-slicing in very unfortunate locations. For
example, a thread might suspend in Hashtable.get(Object) after a call to
getProperty(String) on the system properties. It's a synchronized
global so a few hundred threads might pile up until the lock holder
resumes. Odds are that those hundreds of threads won't finish before
another one stops to time slice again. The performance hit has a ton of
hysteresis so the server doesn't recover until it has a lower load than
before the backlog started.

The brute force fix is of course to eliminate calls to shared
synchronized objects.

Well, yeah.

All of the easy stuff has been done. Some operations aren't well suited
to simple CAS. Bottlenecks that are part of well established Java APIs
are time consuming to fix/avoid.

Sadly true.

Is there JVM or Linux tuning that will change the behavior of thread
time slicing or preemption? I checked the JDK 6 options page but didn't
find anything that appears to be applicable.

I'm not aware of anything. Somewhere out there on the web is a complete
list of the -XX options - see if you can find that and have a read through
it.

tom

Eric Sosman · Jun 1, 2010

I've been assisting in load testing some new high performance servers
running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
suspending threads for time-slicing in very unfortunate locations. For
example, a thread might suspend in Hashtable.get(Object) after a call to
getProperty(String) on the system properties. It's a synchronized
global so a few hundred threads might pile up until the lock holder
resumes. Odds are that those hundreds of threads won't finish before
another one stops to time slice again. The performance hit has a ton of
hysteresis so the server doesn't recover until it has a lower load than
before the backlog started.

You've got "a few hundred threads" all calling System.getProperty
over and over again so rapidly that they get in each other's way? Why?
Seems a very odd situation, but ...

Okay: As a first cut, write your own PropertySurrogate class.
Have it slosh all the System properties into a HashMap instead of
System's Hashtable, and change the System.getProperty calls to
PropertySurrogate.getProperty instead. Since HashMap isn't
synchronized, threads calling PropertySurrogate.getProperty won't
block each other.

If this over-and-over reading and re-reading of the System
properties is because you suspect somebody's changing them, you
could have one thread that just keeps iterating over the System
properties, watching for changes. When it sees a change, it would
update the PropertySurrogate HashMap accordingly. This veriant
needs synchronization, but you could use a ReadWriteLock to allow
multiple read-only threads to access the map simultaneously; they'd
be locked out (briefly) only when the monitoring thread was in the
middle of an update.

Still, it seems to me that what you should be looking at is
the reason for all these repetitive getProperty calls. It sounds
like reading the clock every microsecond to see if it's 2014 yet.

Robert Klemme · Jun 1, 2010

Just out of curiosity: How did you find out?

You call that "brute force" as if it weren't the actual, correct answer.

But necessary. Having repeated calls on system properties that require
synchronization is just plain stupid. System properties are the ones
that don't change during a program run, so they should be (and should
have been) written once into an immutable structure at class-load time
and read thence thereafter. End of synchronization woes for that one.

The jmap/jhat dump utilities have some of that, IIRC. Otherwise you
break into the add-on diagnostic tools.

But really it sounds like your code needs refactoring in that it did not
handle concurrency correctly from jump.

I couldn't agree more to what Lew wrote. If all your threads hammer on
single global resources you've got a design level issue: that
application is simply not built with scalability in mind - even if the
effect did not show up yet with other hardware / different load. This
is nothing you can blame the JVM or hardware for.

Kind regards

robert

Arne Vajhøj · Jun 2, 2010

I've been assisting in load testing some new high performance servers
running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
suspending threads for time-slicing in very unfortunate locations.

That should not come as a surprise.

The thread scheduler does not examine the code for convenience.

Correct code must work no matter when the in and out of
CPU happens.

High performance code must work efficiently no matter when the
in and out of CPU happens.

> For
example, a thread might suspend in Hashtable.get(Object) after a call to
getProperty(String) on the system properties. It's a synchronized
global so a few hundred threads might pile up until the lock holder
resumes. Odds are that those hundreds of threads won't finish before
another one stops to time slice again. The performance hit has a ton of
hysteresis so the server doesn't recover until it has a lower load than
before the backlog started.

The brute force fix is of course to eliminate calls to shared
synchronized objects. All of the easy stuff has been done. Some
operations aren't well suited to simple CAS. Bottlenecks that are part
of well established Java APIs are time consuming to fix/avoid.

High performance code need to be designed not to synchronize
extensively.

If the code does and there is a performance problem, then fix
the code.

There are no miracles.

Is there JVM or Linux tuning that will change the behavior of thread
time slicing or preemption? I checked the JDK 6 options page but didn't
find anything that appears to be applicable.

-XXdonotkickthreadoffcpuifitisabadtimeforperformance does not exist.

Arne

Lew · Jun 2, 2010

Arne said:
High performance code need to be designed not to synchronize
extensively.

If the code does and there is a performance problem, then fix
the code.

This comes up again and again and again in our profession. People almost
never want to hear it. They will go broke spending on "quick-fix" solutions
that aren't quick, fix nothing and solve nothing.

The answer is always "fix the code". This always causes controversy. It
shouldn't.

Fixing the code is cheaper and works. Not fixing the code is more expensive
and doesn't work.

There are no miracles.

This is both true and false.

It's true because the kind of miracle people are hoping for is one that lets
them not admit there is a problem.

Doomed.

It's false because there is a miracle. Doing it right turns out to be
cheaper, easier, quicker and more satisfying, starting very early in the
process, and gets results. It seems miraculous that doing things the right
way produces correct results astonishingly soon.

Arne VajhÃ¸j · Jun 2, 2010

This comes up again and again and again in our profession. People almost
never want to hear it. They will go broke spending on "quick-fix"
solutions that aren't quick, fix nothing and solve nothing.

The answer is always "fix the code". This always causes controversy. It
shouldn't.

Fixing the code is cheaper and works. Not fixing the code is more
expensive and doesn't work.

This is both true and false.

It's true because the kind of miracle people are hoping for is one that
lets them not admit there is a problem.

Doomed.

It's false because there is a miracle. Doing it right turns out to be
cheaper, easier, quicker and more satisfying, starting very early in the
process, and gets results. It seems miraculous that doing things the
right way produces correct results astonishingly soon.

Even though doing it right is likely to be less costly in the
long run, then it is still hard work initially, so I will not
call it a miracle.

Arne

Mike Schilling · Jun 2, 2010

Arne Vajhøj said:
That should not come as a surprise.

The thread scheduler does not examine the code for convenience.

Correct code must work no matter when the in and out of
CPU happens.

High performance code must work efficiently no matter when the
in and out of CPU happens.

High performance code need to be designed not to synchronize
extensively.

If the code does and there is a performance problem, then fix
the code.

There are no miracles.

Though giving a thread higher priority while it holds a shared lock isn't
exactly rocket science; VMS did it back in the early 80s. JVMs could do a
really nice job of this, noticing which monitors cause contention and how
long they tend to be held. A shame they don't.

Robert Klemme · Jun 2, 2010

Though giving a thread higher priority while it holds a shared lock
isn't exactly rocket science; VMS did it back in the early 80s. JVMs
could do a really nice job of this, noticing which monitors cause
contention and how long they tend to be held. A shame they don't.

I can imagine that changing a thread's priority frequently is causing
severe overhead because the OS scheduler has to adjust all the time.
Thread and process priorities are usually set once to indicate overall
processing priority - not to speed up certain operations. Also,
changing the priority does not guarantee anything - there could be other
threads with higher priority around.

I don't think it's a viable approach - especially if applied to fix
broken code (or even design).

Kind regards

robert

Kevin McMurtrie · Jun 2, 2010

Kevin McMurtrie said:
I've been assisting in load testing some new high performance servers
running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
suspending threads for time-slicing in very unfortunate locations. For
example, a thread might suspend in Hashtable.get(Object) after a call to
getProperty(String) on the system properties. It's a synchronized
global so a few hundred threads might pile up until the lock holder
resumes. Odds are that those hundreds of threads won't finish before
another one stops to time slice again. The performance hit has a ton of
hysteresis so the server doesn't recover until it has a lower load than
before the backlog started.

The brute force fix is of course to eliminate calls to shared
synchronized objects. All of the easy stuff has been done. Some
operations aren't well suited to simple CAS. Bottlenecks that are part
of well established Java APIs are time consuming to fix/avoid.

Is there JVM or Linux tuning that will change the behavior of thread
time slicing or preemption? I checked the JDK 6 options page but didn't
find anything that appears to be applicable.

To clarify a bit, this isn't hammering a shared resource. I'm talking
about 100 to 800 synchronizations on a shared object per second for a
duration of 10 to 1000 nanoseconds. Yes, nanoseconds. That shouldn't
cause a complete collapse of concurrency.

My older 4 core Mac Xenon can have 64 threads call getProperty(String)
on a shared Property instance 2 million times each in only 21 real
seconds. That's one call every 164 ns. It's not as good as
ConcurrentHashMap (one per 0.30 ns) but it's no collapse.

Many of the basic Sun Java classes are synchronized. Eliminating all
shared synchronized objects without making a mess of 3rd party library
integration is no easy task.

Next up is looking at the Linux scheduler version and the HotSpot
spinlock timeout. Maybe the two don't mesh and a thread is very likely
to enter a semaphore right as its quanta runs out.

Mike Schilling · Jun 2, 2010

Robert Klemme said:
I can imagine that changing a thread's priority frequently is causing
severe overhead because the OS scheduler has to adjust all the time.
Thread and process priorities are usually set once to indicate overall
processing priority - not to speed up certain operations.

Not at all. In time-sharing systems, it's a common scheduling algorithm to
adjust the effective priority of a process dynamically, e.g. processes that
require user input get a boost above compute-bound ones, to help keep
response times low. As I said, I'm not inventing this: it was state of the
art about 30 years ago.

Robert Klemme · Jun 2, 2010

Not at all. In time-sharing systems, it's a common scheduling algorithm
to adjust the effective priority of a process dynamically, e.g.
processes that require user input get a boost above compute-bound ones,
to help keep response times low. As I said, I'm not inventing this: it
was state of the art about 30 years ago.

That's true but in these cases it's the OS that does it - not the JVM.
From the OS point of view the JVM is just another process and I doubt
there is an interface for the adjustment of the automatic priority
(which in a way would defy "automatic"). The base priority on the other
hand is to indicate the general priority of a thread / process and I
still don't think it's a good idea to change it all the time.

So, either the OS does honor thread state (mutext, IO etc.) and adjusts
prio accordingly or it doesn't. But I don't think it's the job of the JVM.

Cheers

robert

Robert Klemme · Jun 2, 2010

To clarify a bit, this isn't hammering a shared resource. I'm talking
about 100 to 800 synchronizations on a shared object per second for a
duration of 10 to 1000 nanoseconds. Yes, nanoseconds. That shouldn't
cause a complete collapse of concurrency.

It's the nature of locking issues. Up to a particular point it works
pretty well and then locking delays explode because of the positive
feedback.

If you have "a few hundred threads" accessing a single shared lock with
a frequency of 800Hz then you have a design issue - whether you call it
"hammering" or not. It's simply not scalable and if it doesn't break
now it likely breaks with the next step of load increasing.

My older 4 core Mac Xenon can have 64 threads call getProperty(String)
on a shared Property instance 2 million times each in only 21 real
seconds. That's one call every 164 ns. It's not as good as
ConcurrentHashMap (one per 0.30 ns) but it's no collapse.

Well, then stick with the old CPU.

It's not uncommon that moving to
newer hardware with increased processing resources uncovers issues like
this.

Many of the basic Sun Java classes are synchronized. Eliminating all
shared synchronized objects without making a mess of 3rd party library
integration is no easy task.

It would certainly help the discussion if you pointed out which exact
classes and methods you are referring to. I would readily agree that
Sun did a few things wrong initially in the std lib (Vector) which they
partly fixed later. But I am not inclined to believe in a massive (i.e.
affecting many areas) concurrency problem in the std lib.

If they synchronize they do it for good reasons - and you simply need to
limit the number of threads that try to access a resource. A globally
synchronized, frequently accessed resource in a system with several
hundred threads is a design problem - but not necessarily in the
implementation of the resource used but rather in the usage.

Next up is looking at the Linux scheduler version and the HotSpot
spinlock timeout. Maybe the two don't mesh and a thread is very likely
to enter a semaphore right as its quanta runs out.

Btw, as far as I can see you didn't yet disclose how you found out about
the point where the thread is suspended. I'm still curios to learn how
you found out. Might be a valuable addition to my toolbox.

Kind regards

robert

Mike Amling · Jun 2, 2010

Arne said:
-XXdonotkickthreadoffcpuifitisabadtimeforperformance does not exist.

No, but someday there could be an option to let a Thread synchronized
on a monitor for which another Thread is waiting run a little longer, in
hope that it will desynchronize.

--Mike Amling

Eric Sosman · Jun 2, 2010

If the kernel developers wanted to do something to solve priority
inversion, why not implement priority inheritance, a long studied
technique?

The "I've just acquired a lock; please don't slice me yet" hint
is independent of priority inheritance. Both are useful, either
alone or in combination. Solaris does both (or can, when asked
politely). I don't happen to know whether the JVM on Solaris makes
the appropriate prayers and sacrifices to get this to happen, but
it certainly could. Other O/S'es may offer similar capabilities,
and their JVM's may (or may not) use them.

I think we're in agreement, though, that it's not the sort of
thing that the Java coder should try to control directly by juggling
priorities or something. Scheduling is a system-wide allocation of
resources; the Java programmer's world view extends only to his own
application -- and maybe not even that far, if he's writing classes
that will be used in many applications. The Java programmer has
insufficient information to make informed decisions about system-
wide (and shifting) resource demands.

ClassCastException · Jun 2, 2010

The "I've just acquired a lock; please don't slice me yet" hint
is independent of priority inheritance.

Could this be a use for Thread.yield()? Yield just before acquiring a
lock, maybe less likely to get a context switch during the critical
section since the C.S. will be entered right at the start of your next
slice?

Eric Sosman · Jun 2, 2010

Could this be a use for Thread.yield()? Yield just before acquiring a
lock, maybe less likely to get a context switch during the critical
section since the C.S. will be entered right at the start of your next
slice?

You seem to be making a lot of assumptions about how the
scheduler works. Even the existence of "time slices" is not a
given (real-time schedulers, for example, quite often don't use
them). If the scheduler does in fact use time slices, how do
you know that a "no-op" yield() gets you a fresh slice instead
of just a continuation of the old one? And so on, and so on ...

ClassCastException · Jun 2, 2010

You seem to be making a lot of assumptions about how the
scheduler works.

No; the only assumption being made is that the shorter the interval since
a thread has resumed, the lower the probability of its being preempted in
the next x nanoseconds for some fairly small value of x. I don't think
that that's an unreasonable assumption; context switches are expensive
enough that a reasonably-designed OS scheduler is probably going to avoid
frequently putting two of them (on the same core) too close together in
time.

Tom Anderson · Jun 2, 2010

If the kernel developers wanted to do something to solve priority
inversion, why not implement priority inheritance, a long studied
technique?

I don't understand the details, but there are people who are not keen on
it. I think because (a) it can be slow and (b) it can let developers get
away with writing bad code, although the latter sounds like a bullshit
reason to me. Anyway, Linux has priority-inheriting mutexes in the
real-time edition, and as the infrastructure underlying futexes in recent
normal versions. If java uses futexes on linux, then it already has
priority inheritance.

However, priority inheritance, UIVMM, makes no difference at all if the
threads involved are all of the same priority. If a priority-N thread
blocks on a lock held by a priority-N thread, there is no boost to apply,
and nothing in priority inheritance that will get the first thread off the
lock any quicker.

Something that might help is directed yields: if a thread blocks on a lock
with some time left in its timeslice, it donates it to the thread holding
the lock (which then uses all of it, or runs until it releases the lock,
or something). I've come across the idea in IPC mechanisms: process A runs
an IPC primitive which sends a message to process B and waits for a reply;
since A is blocked, it might as well give its time to B in the hope that B
will reply sooner. It's a fairly straightforward extension to apply that
to locking.

tom

Followup to "Serious concurrency problems on fast systems"	8	Jul 2, 2010
Chrome selects wrong thread for JS-Java upcall on 2nd(ish) Applet instance	13	Oct 30, 2010
Oracle jdbc perf on linux and windows	8	Jul 13, 2004
(jvmti) Settings tags on jobjects causes application to crash	0	Mar 27, 2005
[ANN] JRuby 1.5.0.RC3 Released	0	May 4, 2010
many files and NIO	3	Jan 27, 2005
Stuff the purple heart programmers cook up	10	Dec 30, 2004
[ANN] JRuby 1.1.2 Released	2	May 28, 2008

Serious concurrency problems on fast systems

Kevin McMurtrie

Lew

Tom Anderson

Eric Sosman

Robert Klemme

Arne Vajhøj

Lew

Arne VajhÃ¸j

Mike Schilling

Robert Klemme

Kevin McMurtrie

Mike Schilling

Robert Klemme

Robert Klemme

Mike Amling

Eric Sosman

ClassCastException

Eric Sosman

ClassCastException

Tom Anderson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads