Massive Distributed System!

W

Wildfire_Heat

As part of my research I have created this huge java distributed system
for calculations.

The system is composed of a single "Sender" which sends jobs to a farm
of "Calculators" running on about 1000 different host machines. The
Calculators send their results to a single "Receiver". Jobs could be
easy e.g. 1+1 or hard e.g. something like log(1/[cos(ln3)]). As a
result some jobs finish in seconds while others may even take an hour.

I have implemented this system and it works fine. But I am sure there
is room for improvement and optimisation.

It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method? I don't think I can risk too many
messages on the network.
more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative?
more adaptive to the type of job etc

Any other ideas or links to more info are welcome.

Thanks.
 
D

David N. Welton

Wildfire_Heat said:
I have implemented this system and it works fine. But I am sure there
is room for improvement and optimisation.

It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method? I don't think I can risk too many
messages on the network.
more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative?
more adaptive to the type of job etc

Any other ideas or links to more info are welcome.

You might have a look at how the Erlang folks are doing things - there
are some good ideas there for reliable, distributed systems.

http://www.erlang.org/

Ciao,
--
David N. Welton
- http://www.dedasys.com/davidw/

Linux, Open Source Consulting
- http://www.dedasys.com/
 
R

Roedy Green

It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method?

If you have many machines doing the same task, you can presume say
that at most 10% of them will crash. So you wait for 90% of the
results to be in, then wait another 10% of that elapsed time, then you
send out the posse for the remainder -- an "are you still alive" UDP
packet.
 
P

Patrick May

Wildfire_Heat said:
As part of my research I have created this huge java distributed
system for calculations.

The system is composed of a single "Sender" which sends jobs to a
farm of "Calculators" running on about 1000 different host
machines. The Calculators send their results to a single
"Receiver". Jobs could be easy e.g. 1+1 or hard e.g. something like
log(1/[cos(ln3)]). As a result some jobs finish in seconds while
others may even take an hour.

I have implemented this system and it works fine. But I am sure
there is room for improvement and optimisation.

It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this?
Shall I use the old heartbeat method? I don't think I can risk too
many messages on the network.

This is a classic example of an application that could benefit
from Jini technology and JavaSpaces in particular. The basic idea is
known as the Master/Worker Pattern. Googling quickly turns up a
number of hits, including this one:

http://today.java.net/pub/a/today/2005/04/21/farm.html?page=last&x-maxdepth=0

Robustness is provided through a combination of transactional
interaction with the JavaSpace and Jini's leasing mechanism
(http://www.jini.org/Newsletter/DesignCorner/jini_intro_jun05.html).
more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative? more adaptive to the type of
job etc

I strongly recommend that you visit http://www.jini.org and give
Jini a try. It is perfectly suited to your requirements.

Regards,

Patrick
 
P

Patrick May

Roedy Green said:
If you have many machines doing the same task, you can presume say
that at most 10% of them will crash. So you wait for 90% of the
results to be in, then wait another 10% of that elapsed time, then
you send out the posse for the remainder -- an "are you still alive"
UDP packet.

No need to reinvent the wheel. Jini's leasing mechanism provides
this kind of resiliency out of the box. See http://www.jini.org for
more details.

Regards,

Patrick
 
R

Roedy Green

I strongly recommend that you visit http://www.jini.org and give
Jini a try. It is perfectly suited to your requirements.

It depends on what you perceive as his requirements. Does he want to
solve the problem or is he trying the learn about the nuts and bolts
under the hood by constructing something himself from the ground up?

In either case he should at least examine Jini to get a rough idea of
how they solved the problem.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top