rake task dependencies: via timestamps table in database

J

jandot

Hi all,

There is some interest in the bioinformatics community for using rake
as a workflow tool (see e.g. http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/).
Rake could be ideal for this type of work: a typical workflow will
take data and perform a first set of conversions on it (i.e. a task),
followed by a second set of conversions (that is dependent on the
first task), and so on.

However, bioinformaticians try to keep their data in databases rather
than files. And we found we need some workarounds to get dependencies
working. Does anyone know if it would be very difficult to add
functionality to rake to check a meta table in a database for
timestamps of tasks rather than looking at timestamps of files? I was
thinking of a table looking like the one below:

table: meta
task
modified_on
==============================================
001_load_data
20080602_0831
002_calculate_averages 20080602_0845
003_make_histogram_of_averages 20080602_0851

The rakefile would then contain:

task :001_load_data do
<do stuff>
<automatically update record in meta table>
end

task :002_calculate_averages => [:001_load_data] do
<do stuff>
<automatically update record in meta table>
end

task :003_make_histogram_of_averages => [:002_calculate_averages] do
<do stuff>
<automatically update record in meta table>
end

So if we had reloaded the data (001), then the timestamp for that task
in the meta table would be later than the one for task 002. As a
result, task 002 would automatically have to be rerun if we were to
run task 003.

I'd very much like to know if anyone has an idea how rake can be
extended this way. Basically, the dependency checker has to be
extended to look into a fixed table in a database...

Many thanks,
Jan Aerts

-
=================================
Dr Jan Aerts
Senior Bioinformatician
Genome Dynamics and Evolution Group
Wellcome Trust Sanger Institute
Hinxton
Cambridge CB10 1SA
UK

phone: +44 (0)1223 - 494732
web: http://www.sanger.ac.uk/Teams/Team29/
 
P

Pit Capitain

2008/6/6 jandot said:
There is some interest in the bioinformatics community for using rake
as a workflow tool (...)
However, bioinformaticians try to keep their data in databases rather
than files. And we found we need some workarounds to get dependencies
working. Does anyone know if it would be very difficult to add
functionality to rake to check a meta table in a database for
timestamps of tasks rather than looking at timestamps of files?

Hi Jan, if you look at the source code of rake's FileTask, you'll see
that this shouldn't be very difficult. The code consists of only four
methods and is easy to read. Feel free to ask again if you have more
questions.

Regards,
Pit
 
J

jandot

Hi Jan, if you look at the source code of rake's FileTask, you'll see
that this shouldn't be very difficult. The code consists of only four
methods and is easy to read. Feel free to ask again if you have more
questions.

Regards,
Pit

Thanks for that pointer, Pit. I think I got quite far now based on
FileTask. But something is still wrong. The trouble is that I have no
idea where, so can't really ask specific questions...
It looks like the block passed to a task is not executed.

I've put what I already have on github: http://github.com/jandot/biorake/tree/master

There's a sample directory with an example Rakefile that should work
once the extension is fixed. In addition, there are two test suites
copied from the file tests. Unfortunately, many of the tests still
fail.

If anybody could have a look at the tests and help to get them
running, I would be very thankfull.

Cheers,
jan.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top