T
Tony Arcieri
[Note: parts of this message were removed to make it a legal post.]
I'm looking at replacing our homebrew job queue with something better, and
I'm wondering what's out there. If there's something that meets our needs
I'd like to use it; otherwise I may end up building our next gen job queue
by assembling various components that exist already. I've seen a couple
dozen solutions for this in the Ruby world, but haven't scrutinized many of
them in depth and am unsure if there are any that fully meet our needs.
We have many different kinds of jobs, however the main ones I'm worried
about are rather CPU intensive, take a long time to execute, and operate on
large amounts of input and output data.
- Distributed: we certainly have too many jobs to run on a single computer.
We need to distribute them across a number of workers. Beyond that we'd
like to automatically provision more workers when the backlog of jobs gets
too large, and shut down workers if too many of them are idle.
- Fault Tolerant: as with any distributed system, fault tolerance is
important. An evil monkey should be able to futz with any part of the
system, crashing computers willy nilly, and it should continue to Just
Work. If a job breaks down and is somehow lost anywhere along the way, the
system should eventually detect this and retry the job. From a design
perspective, I would like to see as much of the system as stateless as
possible. The agents executing the jobs and message queues should be
stateless. Ideally the entirety of the state of the system is kept in the
database, and that's the only part of the system we need to ensure
recovering state from after a crash. If a worker or message queue goes down
we shouldn't need to worry about recovering jobs-in-flight, they should
simply be retried.
- Idempotent: going along with a stateless approach to fault tolerance, if
the system does misdetect a failed job and ends up executing it twice, the
system should detect this and discard the redundant results. Having the
same job accidentally complete multiple times should not screw up the state
of the system.
- Support for Temporary Failures: sometimes a job fails in such a way that
it should be retried (i.e. external resources needed to complete a job are
temporarily unavailable). The system should support retrying these jobs in
a sensible way, and it'd be nice to specify on a job-by-job basis how many
attempts should be made before the system should give up and consider it a
permanent failure.
- Support for Live Upgrades: we'd like to be able to add new types of jobs
to the system . So along with this, agents should know what types of jobs
they support, and not request unsupported jobs from the message queue.
I suppose some of my comments are presupposing the architecture: a list of
jobs stored in the database, command and control processes which load jobs
into and read results from the message queues, and agents which pull jobs
from and report jobs to the message queues, and the message queues
themselves. I'd be open to other designs, so long as they have the above
properties.
Given that, is there an existing job queue that meets my needs that I should
be checking out?
I'm looking at replacing our homebrew job queue with something better, and
I'm wondering what's out there. If there's something that meets our needs
I'd like to use it; otherwise I may end up building our next gen job queue
by assembling various components that exist already. I've seen a couple
dozen solutions for this in the Ruby world, but haven't scrutinized many of
them in depth and am unsure if there are any that fully meet our needs.
We have many different kinds of jobs, however the main ones I'm worried
about are rather CPU intensive, take a long time to execute, and operate on
large amounts of input and output data.
- Distributed: we certainly have too many jobs to run on a single computer.
We need to distribute them across a number of workers. Beyond that we'd
like to automatically provision more workers when the backlog of jobs gets
too large, and shut down workers if too many of them are idle.
- Fault Tolerant: as with any distributed system, fault tolerance is
important. An evil monkey should be able to futz with any part of the
system, crashing computers willy nilly, and it should continue to Just
Work. If a job breaks down and is somehow lost anywhere along the way, the
system should eventually detect this and retry the job. From a design
perspective, I would like to see as much of the system as stateless as
possible. The agents executing the jobs and message queues should be
stateless. Ideally the entirety of the state of the system is kept in the
database, and that's the only part of the system we need to ensure
recovering state from after a crash. If a worker or message queue goes down
we shouldn't need to worry about recovering jobs-in-flight, they should
simply be retried.
- Idempotent: going along with a stateless approach to fault tolerance, if
the system does misdetect a failed job and ends up executing it twice, the
system should detect this and discard the redundant results. Having the
same job accidentally complete multiple times should not screw up the state
of the system.
- Support for Temporary Failures: sometimes a job fails in such a way that
it should be retried (i.e. external resources needed to complete a job are
temporarily unavailable). The system should support retrying these jobs in
a sensible way, and it'd be nice to specify on a job-by-job basis how many
attempts should be made before the system should give up and consider it a
permanent failure.
- Support for Live Upgrades: we'd like to be able to add new types of jobs
to the system . So along with this, agents should know what types of jobs
they support, and not request unsupported jobs from the message queue.
I suppose some of my comments are presupposing the architecture: a list of
jobs stored in the database, command and control processes which load jobs
into and read results from the message queues, and agents which pull jobs
from and report jobs to the message queues, and the message queues
themselves. I'd be open to other designs, so long as they have the above
properties.
Given that, is there an existing job queue that meets my needs that I should
be checking out?