NIO multiplexing + thread pooling

Giovanni Azua · Sep 24, 2011

Hi all,

I need to build a Client-Middleware-Database architecture in the context of
a course in my grad studies. In a nutshell, multiple Client component
instances connect and send random DML statements to multiple Middleware
(server) which in turn execute those DML against a Database server and send
the results back to the Clients. After carefully studding the description
and requirements I have two choices for the communication between the Client
and Middleware:

- Classic sockets + thread pool.
- NIO multiplexing.

I think I will end implementing both but I am curious in the case of the NIO
multiplexing: in your experience would it increase performance to setup a
Thread Pool and associate one SelectionKey to one specific Thread in the
pool so that all read/write operations for each specific channel go through
one specific Thread from the Pool? From all code examples/tutorials/books I
have reviewed online the "Selector Thread" seems like a bottleneck to me. I
haven't seen this approach anywhere so that's why I am asking.

Thanks in advance,
Best regards,
Giovanni

PS: first time I post from Entourage .. hope the formatting etc is not too
horrible.

Tom Anderson · Sep 24, 2011

would it increase performance to setup a Thread Pool and associate one
SelectionKey to one specific Thread in the pool so that all read/write
operations for each specific channel go through one specific Thread from
the Pool?

That would not be a thread pool. That would be a thread-per-connection
model. I can't say that the performance would be worse than with a real
thread pool, but i doubt it would be better than using blocking IO.

From all code examples/tutorials/books I have reviewed online the
"Selector Thread" seems like a bottleneck to me.

If the selector thread's only job is to identify channels which are ready
to be worked on, handing them off to worker threads to be processed, then
it doesn't really have a lot of work of its own to do, and so it's
unlikely to be a bottleneck.

If you're processing the channels in the selector thread (in which case
it's not really a selector thread), then it could be a bottleneck. So, you
can have several of them; Selector.select() is threadsafe, so with a
little attention to locking, you can have many threads selecting and then
working. This is called the leader/followers model.

One thing you could consider, if using a single selector thread and a pool
of worker threads, is trying to create some affinity between channels and
threads, so that the same thread always handles the work on a particular
channel (and on several channels); on a multiprocessor machine, this
should increase cache performance, but is not otherwise particularly
valuable. This is not entirely straightforward, because of thread and
channel starvation issues; you should read up on work-stealing if you want
to do that.

tom

Giovanni Azua · Sep 24, 2011

Hi Tom,

Thank you for your insights.

If the selector thread's only job is to identify channels which are ready
to be worked on, handing them off to worker threads to be processed, then
it doesn't really have a lot of work of its own to do, and so it's
unlikely to be a bottleneck.

If you're processing the channels in the selector thread (in which case
it's not really a selector thread), then it could be a bottleneck. So, you
can have several of them; Selector.select() is threadsafe, so with a
little attention to locking, you can have many threads selecting and then
working. This is called the leader/followers model.

One thing you could consider, if using a single selector thread and a pool
of worker threads, is trying to create some affinity between channels and
threads, so that the same thread always handles the work on a particular
channel (and on several channels); on a multiprocessor machine, this
should increase cache performance, but is not otherwise particularly
valuable. This is not entirely straightforward, because of thread and
channel starvation issues; you should read up on work-stealing if you want
to do that.

This is exactly what I was asking in the OP, this "affinity between channels
and threads" as you call it. I didn't mean a "thread-per-connection model".
I will try to find a way to implement this affinity on top of the
ThreadPoolExecutor even though it is not very clear from its API that it can
be done.

Among others, I was following this tutorial
<http://rox-xmlrpc.sourceforge.net/niotut/index.html> and there he overloads
a bit the single "Selector Thread" for reading and sending data. AFAIK there
is no sample code doing Selector.select() from separate threads.

Thank you again.
Best regards,
Giovanni

John B. Matthews · Sep 25, 2011

Giovanni Azua said:
I need to build a Client-Middleware-Database architecture in the
context of a course in my grad studies. In a nutshell, multiple
Client component instances connect and send random DML statements to
multiple Middleware (server) which in turn execute those DML against
a Database server and send the results back to the Clients. After
carefully studding the description and requirements I have two
choices for the communication between the Client and Middleware:

- Classic sockets + thread pool.
- NIO multiplexing.

I think I will end implementing both but I am curious in the case of
the NIO multiplexing: in your experience would it increase
performance to setup a Thread Pool and associate one SelectionKey to
one specific Thread in the pool so that all read/write operations for
each specific channel go through one specific Thread from the Pool?
From all code examples/tutorials/books I have reviewed online the
"Selector Thread" seems like a bottleneck to me. I haven't seen this
approach anywhere so that's why I am asking.

This related thread may be helpful:

<http://groups.google.com/group/comp.lang.java.programmer/browse_thread/thread/cd054f226e2d5c82>

Giovanni Azua · Sep 25, 2011

Hello Matthews,

This related thread may be helpful:

<http://groups.google.com/group/comp.lang.java.programmer/browse_thread/thread
/cd054f226e2d5c82>

I found that presentation and following threads in internet yesterday and I
must admit I was surprised

I actually don't exactly know where this NIO
being faster stuck from but it is widespread knowledge already that it is
supposed to be faster

I also read about blocking NIO which might be more efficient than classic
IO. So apparently there are three choices to consider here.

The meat of course I am taking "Advanced Performance System Analysis" (which
is a mandatory CS Master subject at ETH) is not so much about building the
architecture but more about modeling it quantitatively i.e. Modeling
throughput, runtime, model the performance using Queuing Theory, Little Law;
do experimental analysis and ANOVA to find which parameters (CUT) more
drastically affect performance e.g. might be number of threads in
ThreadPoolExecutor or the number of database servers.

That's why I was thinking supporting both means of communication would be
interesting for me so I would 1) finally learn NIO and 2) have this choice
of communication as another "degree of freedom" in the system as part of the
performance experimental analysis i.e. Try to answer whether NIO is faster
than classic IO but I guess this will depend a lot on the quality of my
implementation.

Thank you.
Best regards,
Giovanni

Robert Klemme · Sep 25, 2011

That would not be a thread pool. That would be a thread-per-connection
model. I can't say that the performance would be worse than with a real
thread pool, but i doubt it would be better than using blocking IO.

As far as I can see Giovanni did not say that there was a 1:1
relationship between threads and channels. The approach does actually
make sense if there is a fixed limit on the number of channels one
thread is responsible for. The idea being that a single thread can pull
of just that much IO and if the number of clients is not limited a
single thread may indeed be the bottleneck. On the other hand this
approach does create less threads than the thread per connection model
with blocking IO. Whether it is actually worthwhile considering the
higher implementation effort of this scheme vs. the simplest approach
(thread per connection with blocking IO) is another question. If the
number of channels is limited in some way I'd rather go for the simple
implementation with blocking IO.

Some more reading
http://www.scs.stanford.edu/~dm/home/papers/dabek:event.pdf
http://www.eecs.harvard.edu/~mdw/papers/events.pdf

There is another paper which I cannot find at the moment and which
demonstrates advantages of a thread based implementation with a
threading library which has rather low overhead. I'll try to dig it up
after the weekend.

If the selector thread's only job is to identify channels which are
ready to be worked on, handing them off to worker threads to be
processed, then it doesn't really have a lot of work of its own to do,
and so it's unlikely to be a bottleneck.

If you're processing the channels in the selector thread (in which case
it's not really a selector thread), then it could be a bottleneck. So,
you can have several of them; Selector.select() is threadsafe, so with a
little attention to locking, you can have many threads selecting and
then working. This is called the leader/followers model.

I think the approach with fixed responsibilities of selector threads to
multiple channels scales better than using a single Selector for all
channels from multiple threads. But that may depend on the use case.
Especially in light of short lived connections I can see some overhead
for the assignment of channels to threads. Also, rebalancing might be
an issue in case one thread has significantly less channels to handle

One thing you could consider, if using a single selector thread and a
pool of worker threads, is trying to create some affinity between
channels and threads, so that the same thread always handles the work on
a particular channel (and on several channels); on a multiprocessor
machine, this should increase cache performance, but is not otherwise
particularly valuable. This is not entirely straightforward, because of
thread and channel starvation issues; you should read up on
work-stealing if you want to do that.

It's an interesting topic and getting it right (high throuput, low
resource usage) is certainly not easy.

Kind regards

robert

Tom Anderson · Sep 27, 2011

As far as I can see Giovanni did not say that there was a 1:1 relationship
between threads and channels.

I think i read "associate one SelectionKey to one specific Thread" as
meaning that, but evidently wrongly.

The approach does actually make sense if there is a fixed limit on the
number of channels one thread is responsible for. The idea being that a
single thread can pull of just that much IO and if the number of clients
is not limited a single thread may indeed be the bottleneck. On the
other hand this approach does create less threads than the thread per
connection model with blocking IO.

Hold on - there are two things here.

One, an N:M threading model, where you have N connections being served by
M threads, where N >> M. That, i readily agree, could offer better
performance than N:N or N:1 models.

But two, the idea that the M threads should have affinity for connections
- that, in effect, we would have an M*(N/M:1) model (i'm sure that
notation is crystal clear!). That, i am skeptical about.

I think the approach with fixed responsibilities of selector threads to
multiple channels scales better than using a single Selector for all
channels from multiple threads.

Yes. The latter is sort of N:1:M, and that has twice as many colons, where
colons are communication overhead.

But i don't see why the thread-affine M*(N/M:1) model would be any faster
than the simpler N:M (leader/followers) model.

tom

Robert Klemme · Sep 28, 2011

I think i read "associate one SelectionKey to one specific Thread" as
meaning that, but evidently wrongly.

Hold on - there are two things here.

One, an N:M threading model, where you have N connections being served
by M threads, where N >> M. That, i readily agree, could offer better
performance than N:N or N:1 models.
OK.

But two, the idea that the M threads should have affinity for
connections - that, in effect, we would have an M*(N/M:1) model (i'm
sure that notation is crystal clear!). That, i am skeptical about.
Why?

Yes. The latter is sort of N:1:M, and that has twice as many colons,
where colons are communication overhead.

More importantly colons are _synchronization overhead_!

But i don't see why the thread-affine M*(N/M:1) model would be any
faster than the simpler N:M (leader/followers) model.

Because you reduce synchronization overhead. If you have 1 or X threads
doing the selecting and distributing the work over M >> X handler
threads for N >> M channels then you have much higher potential for
collisions than if you have X * (1 selector for M/X handler threads).
Basically this is a way to partition the application into independent parts.

The point at which this effect has negative impact depends of course on
the characteristics of the communication which I have mentioned
elsewhere IIRC. These are: the number of channels, of course, the
message rate and message size per channel. in other words, the higher
the throughput per channel and the higher the number of channels the
earlier you will see negative effects of too many threads having to go
through the synchronization necessary to distribute the work.

It will be tricky though to get changes in the number of open channels
dealt with in a way that the whole partitioning model is not violated
too much but resources (threads) are not wasted to the point that there
is just one handler thread per channel which could certainly happen if
there was no rebalancing (e.g. all but one channels a thread is
responsible for have died and the channel is not reassigned to another
group).

Interesting stuff.

Kind regards

robert

markspace · Sep 28, 2011

Because you reduce synchronization overhead. If you have 1 or X threads
doing the selecting and distributing the work over M >> X handler
threads for N >> M channels then you have much higher potential for
collisions than if you have X * (1 selector for M/X handler threads).

<http://www.myfacewhen.net/uploads/954-not-sure-if-serious.jpg>

Really? Is synchronization, relative to an app that's doing network IO,
a serious concern? I can't imagine that synchronization overhead is
even within a three orders of magnitude of the IO overhead.

Note that poor structuring, like organizing your app such that the whole
thing is serialized on one lock, isn't synchronization over head. It's
just blocked.

Robert Klemme · Sep 28, 2011

Really? Is synchronization, relative to an app that's doing network IO,
a serious concern? I can't imagine that synchronization overhead is even
within a three orders of magnitude of the IO overhead.

It all depends of course how it's done. But imagine 10,000 open
channels with traffic, 1,000 reader threads and a single thread doing
selecting and dispatching to reader threads. If you do not partition
(and this is the case I was talking about) you probably put something
into a single queue from which those 1,000 reader threads read and the
selector thread is hammering it all the time from the other end. And it
gets worse if the number of channels increases. If OTOH you have a
fixed ratio between channels, selector thread and reader threads you
limit the concurrency on the synchronization point.

Note that poor structuring, like organizing your app such that the whole
thing is serialized on one lock, isn't synchronization over head. It's
just blocked.

Suggest a synchronization mechanism for the use case I described above
(1 selector, 10,000 channels, 1,000 readers) and then we can talk about
properties of that.

Kind regards

robert

markspace · Sep 28, 2011

It all depends of course how it's done. But imagine 10,000 open channels
with traffic, 1,000 reader threads and a single thread doing selecting
and dispatching to reader threads.

How much traffic? One single 100 gigabit port running all out?

I think this scenario would depend on the number of packets received,
not even actual traffic. 100 1 byte packets will require a lot more
attention than 1 100 byte packet. Packet reception should be where the
bottleneck is.

I think you'd have to profile an actual application to determine where
the time spend really is. I'd like to see what someone with 10k open
ports is actually doing, and what sort of hardware they're running on.

If you do not partition (and this is
the case I was talking about) you probably put something into a single
queue from which those 1,000 reader threads read and the selector thread
is hammering it all the time from the other end.

There's only one physical network controller. Running flat out and with
an average of 100 byte length packets (something like a typical HTTP GET
request), I estimated that the network controller would have to write
out one packet to memory every 30 nano seconds.

I'm not sure that's feasible.

Wikipedia indicates that DDR2 memory has a bandwidth of around 12 GB/s.
That's on the order of 100 Gb ethernet. But "on the order of" means
there's no bandwidth left for any other device, including the CPU or
hard disc.

Suggest a synchronization mechanism for the use case I described above
(1 selector, 10,000 channels, 1,000 readers) and then we can talk about
properties of that.

Show me an actual app first, I'm not sure what you've proposed is
actually slightly realistic.

1 selector = 1 physical network device. I'm not sure banging on one
device with 10 threads is somehow better than just letting one thread
interface with the device.

markspace · Sep 29, 2011

I don't have specific numbers. But I can tell you that synchronization
overhead, and in particular the cost of a thread context switch, is one
of the reasons that i/o completion ports on Windows is such a critical
technique for scalable networking processes.

Interesting....

<http://stackoverflow.com/questions/2794535/linux-and-i-o-completion-ports>

Lew · Sep 30, 2011

markspace said:
Interesting....

<http://stackoverflow.com/questions/2794535/linux-and-i-o-completion-ports>

Context switching is a different issue from synchronization. Context switches happen even in non-critical sections, where synchronization does not apply.

This does not invalidate the rest of Pete's points since they emanate from the cost of context switching, which is actually more frequent and a more impactful phenomenon than synchronization.

So in the simple-minded case where a separate single thread handles each connection there is possibly no synchronization at all between the threads, but context switching still militates against this technique for too many concurrent connections.

OTOH, choice of operating system and hardware can influence this equation. For example, the QNX operating system on Intel/AMD is famously fast at context switches.

markspace · Sep 30, 2011

Context switching is a different issue from synchronization. Context
switches happen even in non-critical sections, where synchronization
does not apply.

Thanks for pointed that out, since it is indeed true. Not all
synchronization requires a context switch, or even an OS call. Not all
multi-threading requires context switching either.

This does not invalidate the rest of Pete's points since they emanate
from the cost of context switching, which is actually more frequent
and a more impactful phenomenon than synchronization.

What I thought most interesting about that stackoverflow page was the
assertion that different methods work optimally on different systems.

That is, IOCP works well on Windows because the underlying system has
been optimized for IOCP. Whereas Linux "notification" (selectors, I
think) work well on *nix because that path has been optimized for *nix
systems.

I guess my point was "test what you have, not what the other guys says
works." Just because Java doesn't use IOCP, or does use selectors,
doesn't mean it will/won't run well on your target system. There's lots
of ways of doing things "optimaly," but you always gotta test to verify
it's working the way you think it is.

Lew · Sep 30, 2011

markspace said:
Not all multi-threading requires context switching either.

Huh?

markspace · Sep 30, 2011

Huh?

Imagine a situation where the number of threads is matched to the number
of CPUs. No need to switch out a context there (though it might happen
anyway, there isn't a *need*). Just pass data back and forth. Memory
barriers (happens-before in Java) is all you need in this case.

Robert Klemme · Sep 30, 2011

Imagine a situation where the number of threads is matched to the number
of CPUs. No need to switch out a context there (though it might happen
anyway, there isn't a *need*). Just pass data back and forth. Memory
barriers (happens-before in Java) is all you need in this case.

That sounds fairly theoretical. On one hand a system typically has more
processes (let alone threads) than cores. On the other hand even if the
number of cores is large enough you will have management overhead
through the OS's scheduler.

Kind regards

robert

How does creating connection pooling work with multiple client requests?	0	Nov 21, 2022
Object pooling	11	Sep 19, 2010
instance pooling ?	12	May 11, 2010
NIO accept() loop, traditional thread for processing	10	Apr 23, 2007
Java NIO server and Python asyncore client	2	Feb 5, 2013
NIO UDP and TCP	5	Sep 10, 2007
JAVA NIO	4	Jun 16, 2007
Question about non-blocking NIO and Selection Keys	12	Jul 1, 2008

NIO multiplexing + thread pooling

Giovanni Azua

Tom Anderson

Giovanni Azua

John B. Matthews

Giovanni Azua

Robert Klemme

Tom Anderson

Robert Klemme

markspace

Robert Klemme

markspace

markspace

Lew

markspace

Lew

markspace

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads