Massive Distributed System!

Wildfire_Heat · Nov 1, 2005

As part of my research I have created this huge java distributed system
for calculations.

The system is composed of a single "Sender" which sends jobs to a farm
of "Calculators" running on about 1000 different host machines. The
Calculators send their results to a single "Receiver". Jobs could be
easy e.g. 1+1 or hard e.g. something like log(1/[cos(ln3)]). As a
result some jobs finish in seconds while others may even take an hour.

I have implemented this system and it works fine. But I am sure there
is room for improvement and optimisation.

It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method? I don't think I can risk too many
messages on the network.
more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative?
more adaptive to the type of job etc

Any other ideas or links to more info are welcome.

Thanks.

David N. Welton · Nov 1, 2005

Wildfire_Heat said:
I have implemented this system and it works fine. But I am sure there
is room for improvement and optimisation.

It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method? I don't think I can risk too many
messages on the network.
more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative?
more adaptive to the type of job etc

Any other ideas or links to more info are welcome.

You might have a look at how the Erlang folks are doing things - there
are some good ideas there for reliable, distributed systems.

http://www.erlang.org/

Ciao,
--
David N. Welton
- http://www.dedasys.com/davidw/

Linux, Open Source Consulting
- http://www.dedasys.com/

Roedy Green · Nov 1, 2005

It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method?

If you have many machines doing the same task, you can presume say
that at most 10% of them will crash. So you wait for 90% of the
results to be in, then wait another 10% of that elapsed time, then you
send out the posse for the remainder -- an "are you still alive" UDP
packet.

Patrick May · Nov 1, 2005

Wildfire_Heat said:
As part of my research I have created this huge java distributed
system for calculations.

The system is composed of a single "Sender" which sends jobs to a
farm of "Calculators" running on about 1000 different host
machines. The Calculators send their results to a single
"Receiver". Jobs could be easy e.g. 1+1 or hard e.g. something like
log(1/[cos(ln3)]). As a result some jobs finish in seconds while
others may even take an hour.

I have implemented this system and it works fine. But I am sure
there is room for improvement and optimisation.

It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this?
Shall I use the old heartbeat method? I don't think I can risk too
many messages on the network.

This is a classic example of an application that could benefit
from Jini technology and JavaSpaces in particular. The basic idea is
known as the Master/Worker Pattern. Googling quickly turns up a
number of hits, including this one:

http://today.java.net/pub/a/today/2005/04/21/farm.html?page=last&x-maxdepth=0

Robustness is provided through a combination of transactional
interaction with the JavaSpace and Jini's leasing mechanism
(http://www.jini.org/Newsletter/DesignCorner/jini_intro_jun05.html).

more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative? more adaptive to the type of
job etc

I strongly recommend that you visit http://www.jini.org and give
Jini a try. It is perfectly suited to your requirements.

Regards,

Patrick

Patrick May · Nov 1, 2005

Roedy Green said:
If you have many machines doing the same task, you can presume say
that at most 10% of them will crash. So you wait for 90% of the
results to be in, then wait another 10% of that elapsed time, then
you send out the posse for the remainder -- an "are you still alive"
UDP packet.

No need to reinvent the wheel. Jini's leasing mechanism provides
this kind of resiliency out of the box. See http://www.jini.org for
more details.

Regards,

Patrick

Roedy Green · Nov 1, 2005

I strongly recommend that you visit http://www.jini.org and give
Jini a try. It is perfectly suited to your requirements.

It depends on what you perceive as his requirements. Does he want to
solve the problem or is he trying the learn about the nuts and bolts
under the hood by constructing something himself from the ground up?

In either case he should at least examine Jini to get a rough idea of
how they solved the problem.

[Q] synchronize a "mocked" clock in a distributed system	12	Jul 1, 2010
SAP System Admin	0	Aug 15, 2012
SAP System Admin	0	Aug 14, 2012
SAP System Admin	0	Aug 18, 2012
SAP System Admin	0	Aug 20, 2012
SAP System Admin	0	Aug 19, 2012
Suggestions for a distributed job queue	14	Dec 22, 2009
It's a Bird, It's a plane, no! umm, it's a Distributed Agent?	0	Aug 20, 2010

Massive Distributed System!

Wildfire_Heat

David N. Welton

Roedy Green

Patrick May

Patrick May

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads