Serious concurrency problems on fast systems

  • Thread starter Kevin McMurtrie
  • Start date
K

Kevin McMurtrie

Robert Klemme said:
It's the nature of locking issues. Up to a particular point it works
pretty well and then locking delays explode because of the positive
feedback.

If you have "a few hundred threads" accessing a single shared lock with
a frequency of 800Hz then you have a design issue - whether you call it
"hammering" or not. It's simply not scalable and if it doesn't break
now it likely breaks with the next step of load increasing.


Well, then stick with the old CPU. :) It's not uncommon that moving to
newer hardware with increased processing resources uncovers issues like
this.


It would certainly help the discussion if you pointed out which exact
classes and methods you are referring to. I would readily agree that
Sun did a few things wrong initially in the std lib (Vector) which they
partly fixed later. But I am not inclined to believe in a massive (i.e.
affecting many areas) concurrency problem in the std lib.

If they synchronize they do it for good reasons - and you simply need to
limit the number of threads that try to access a resource. A globally
synchronized, frequently accessed resource in a system with several
hundred threads is a design problem - but not necessarily in the
implementation of the resource used but rather in the usage.


Btw, as far as I can see you didn't yet disclose how you found out about
the point where the thread is suspended. I'm still curios to learn how
you found out. Might be a valuable addition to my toolbox.

Kind regards

robert

I have tools based on java.lang.management that will trace thread
contention. Thread dumps from QUIT signals show it too. The threads
aren't permanently stuck, they're just passing through 100000 times
slower than normal.

The problem with staying with on the old system is that Oracle bought
Sun and some unpleasant changes are coming. MacOS X is only suited for
development machines.

Problem areas:

java.util.Properties - Removed from in-house code but still everywhere
else for everything. Used a lot by Sun and 3rd party code. Only
performs poorly on Linux.

org.springframework.context.support.ReloadableResourceBundleMessageSource
- Single-threaded methods down in the bowels of Spring. Only performs
poorly on Linux.

Log4J - Always sucks and needs to be replaced. In the meantime,
removing logging calls except when critical.

Pools, caches, and resource managers - In-house code that is expected to
run 100 - 300 times per second. Has no dependencies during
synchronization. Has been carefully tuned to be capable of millions of
calls per second on 2, 4, and 8 core hardware. They only stall on a
high-end Linux boxes.
 
T

Tom Anderson

The problem with staying with on the old system is that Oracle bought
Sun and some unpleasant changes are coming. MacOS X is only suited for
development machines.

BSD?

tom
 
T

Tom Anderson

Well, then use an immutable Hash map as Lew suggested and store it via
AtomicReference.

Or even a volatile variable. If you're not doing CAS, there's no advantage
to using an AtomicReference over a volatile. Mind you, there shouldn't be
any disadvantage either.

tom
 
R

Robert Klemme

I have tools based on java.lang.management that will trace thread
contention.

Which tools?
Thread dumps from QUIT signals show it too. The threads
aren't permanently stuck, they're just passing through 100000 times
slower than normal.

I am not sure I understand how you found out with these tools that
threads are suspended "for time-slicing in very unfortunate locations".
The problem with staying with on the old system is that Oracle bought
Sun and some unpleasant changes are coming. MacOS X is only suited for
development machines.

Which changes do you expect?
Problem areas:

java.util.Properties - Removed from in-house code but still everywhere
else for everything. Used a lot by Sun and 3rd party code. Only
performs poorly on Linux.

Even if not shared across threads?
org.springframework.context.support.ReloadableResourceBundleMessageSource
- Single-threaded methods down in the bowels of Spring. Only performs
poorly on Linux.

Log4J - Always sucks and needs to be replaced. In the meantime,
removing logging calls except when critical.

Hm, so far we haven't had issues with Log4J unless used for excessive
logging (i.e. running production in DEBUG which is not really intended
use). As long as you log into a single sink then any concurrently used
log solution will have good potential for contention. :)
Pools, caches, and resource managers - In-house code that is expected to
run 100 - 300 times per second. Has no dependencies during
synchronization. Has been carefully tuned to be capable of millions of
calls per second on 2, 4, and 8 core hardware. They only stall on a
high-end Linux boxes.

Since your high end box has more cores (does it?) and is generally
faster it will sooner exhibit bottlenecks via the cascading effect Lew
described earlier. Although I would readily concede that JVMs and Java
standard libraries do have bugs I am generally more inclined to believe
in a design level solution. For example: if you have a global
connection pool and all threads share it, increasing the number of
threads will at some point lead to contention. In that case you might
have to group threads with a fixed max group size and have a pool per
group. We did a similar thing with ThreadPoolExecutor where we created
several ThreadPoolExecutors and at enqueue time we use round robin to
schedule instances. This limits the number of threads competing for a
single queue's locks. Scheduling is done via
AtomicInteger.incrementAndGet().

Kind regards

robert
 
A

Arne Vajhøj

It happened today again during testing of a different server class on
the same OS and hardware. This time it was under a microscope. There
were 10 gigabytes of idle RAM, no DB contention, no tenured GC, no disk
contention, and the total CPU was around 25%. There was no gridlock
effect - it always involved one synchronized method that did not depend
on other resources to complete. Throughput dropped to ~250 calls per
second at a specific method for several seconds then it recovered. Then
it happened again elsewhere, then recovered. After several minutes the
server was at top speed again. We then pushed traffic until its 1Gbps
Ethernet link saturated and there wasn't a trace of thread contention
ever returning.

That periodic behavior points to something related to GC.

You could try and experiment with various -XX affecting GC
to see if it could change the behavior. If it could, then
it somewhat verifies that it is related to GC.

Another interesting thing would be to try with another
JVM (from SUN, IBM and BEA/Oracle).

Arne
 
A

Arne Vajhøj

The problem with staying with on the old system is that Oracle bought
Sun and some unpleasant changes are coming. MacOS X is only suited for
development machines.

AFAIK then Oracle has not announced any unpleasant things
and unless you happen to be Larry's neighbor and get some
inside tips over then fence, then it sounds as rumors.
Log4J - Always sucks and needs to be replaced. In the meantime,
removing logging calls except when critical.

Many people use log4j in high volume apps.

Arne
 
A

Arne Vajhøj

What if they use system properties promiscuously? Hypothetically:

1. My application receives XML messages.
2. I use a third-party library to deserialize the XML into Java objects.
3. The third-party library uses JAXP to find an XML parser.
4. JAXP always checks for a system property that points to the parser's
class name.

Even if the details are off (I don't know whether current versions of
JAXP cache the class name), you get the idea.

Given the relative time to:
- parse an XML document
- access a system property
then I think you will need a lot of cores to get a problem
in this scenario.

Arne
 
A

Arne Vajhøj

HotSpot has some (benchmark-driven?) optimizations for this case. It's
hard to not hit them when using simple tests on String and
ConcurrentHashMap.

There is still something wrong.

The numbers indicate that the entire get may have been optimized
away by the JIT compiler.
Properties is a biggie. A brute-force replacement of Properties caused
the system throughput to collapse to almost nothing in Spring's
ResourceBundleMessageSource. There's definitely a JVM/OS problem. The
next test is to disable hyperthreading.

Based on everything posted here then it sounds as an app problem.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,433
Messages
2,571,683
Members
48,796
Latest member
Greg L.

Latest Threads

Top