W
Wildfire_Heat
As part of my research I have created this huge java distributed system
for calculations.
The system is composed of a single "Sender" which sends jobs to a farm
of "Calculators" running on about 1000 different host machines. The
Calculators send their results to a single "Receiver". Jobs could be
easy e.g. 1+1 or hard e.g. something like log(1/[cos(ln3)]). As a
result some jobs finish in seconds while others may even take an hour.
I have implemented this system and it works fine. But I am sure there
is room for improvement and optimisation.
It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method? I don't think I can risk too many
messages on the network.
more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative?
more adaptive to the type of job etc
Any other ideas or links to more info are welcome.
Thanks.
for calculations.
The system is composed of a single "Sender" which sends jobs to a farm
of "Calculators" running on about 1000 different host machines. The
Calculators send their results to a single "Receiver". Jobs could be
easy e.g. 1+1 or hard e.g. something like log(1/[cos(ln3)]). As a
result some jobs finish in seconds while others may even take an hour.
I have implemented this system and it works fine. But I am sure there
is room for improvement and optimisation.
It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method? I don't think I can risk too many
messages on the network.
more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative?
more adaptive to the type of job etc
Any other ideas or links to more info are welcome.
Thanks.