Rake dependencies unknown prior to running tasks

Joe Wölfel · Sep 24, 2008

Say I don't know what all the dependencies are until I've already
begun executing tasks? To what extent can I add new tasks and
dependencies on the fly? At first I thought that adding tasks
during task execution didn't seem to be a safe thing to do. Then I
made a few toy examples that seem to confirm this. So I started
thinking that I need to have Rake call Rake, which seemed a bit
clumsy. But then I read this discussion mentioned by Jim Weirich
that seemed to imply that I ought to be able to make a single rake
file do the job (http://markmail.org/message/jttfqf6wstvgahec#query:
+page:1+mid:zlc5qjj5r6abfcse+state:results). But I couldn't find
any details on how to accomplish this. Are there better ways of
handing this problem that don't involve multiple rake files and rake
calling rake?

Mike Gold · Sep 24, 2008

Joe said:
Say I don't know what all the dependencies are until I've already
begun executing tasks? To what extent can I add new tasks and
dependencies on the fly?

This is what 'import' is for,

import 'moretasks.rb'

moretasks.rb is run after the Rakefile has been loaded but before any
tasks are invoked.

Actually I see no reason why it has to be a file. It looks like
'import' should take an optional block.

Mike Gold · Sep 24, 2008

Mike said:
This is what 'import' is for

Sorry I just realized you meant that you actually create new tasks after
the task invocations have begun.

In this case, are you certain those things creating tasks should be
tasks? It seems like you should have normal ruby classes/methods which
determine which tasks to create, then create them. That is what I do.

I think this strategy covers all cases, even though you may need to
restructure your code. But in the end it's a cleaner approach, IMO.

Joe Wölfel · Sep 24, 2008

Cleaner, maybe. But inefficient in my case. That would mean a lot =20
of unnecessary rebuilding. Unfortunately, efficiency matters in =20
this case. It can take days or weeks even with parallel builds. And =20=

it needs to be done often.

It seems like the wrong way to do it, but the only efficient solution =20=

I've come up with so far is to have Rake call itself with a different =20=

task. So basically I have dependency graph 1, which is known at the =20
outset and dependency graph 2 which is only known after running tasks =20=

in dependency graph 1, and dependency graph 2 is itself dependent on =20
dependancy graph 1.

It seems like a common problem. I've run into a number of build =20
systems that needed to be restarted several times to get around =20
similar issues. But if there's a better solution already out there =20
I'd like to use it.

Mike Gold · Sep 24, 2008

Joe said:
Cleaner, maybe. But inefficient in my case. That would mean a lot
of unnecessary rebuilding. Unfortunately, efficiency matters in
this case. It can take days or weeks even with parallel builds. And
it needs to be done often.

It seems like the wrong way to do it, but the only efficient solution
I've come up with so far is to have Rake call itself with a different
task. So basically I have dependency graph 1, which is known at the
outset and dependency graph 2 which is only known after running tasks
in dependency graph 1, and dependency graph 2 is itself dependent on
dependancy graph 1.

It seems like a common problem. I've run into a number of build
systems that needed to be restarted several times to get around
similar issues. But if there's a better solution already out there
I'd like to use it.

I don't see why it would be inefficient or require unnecessary
rebuilding.

If you follow the strategy I mentioned, making your changes to the graph
before the first invoke, and avoiding tasks creating tasks (which is
forbidden anyway with the new parallel -j support in Drake), then you've
removed the dependency between graph 1 and graph 2 you describe.

By removing that dependency, it becomes *more* efficient because more
tasks can be parallelized, whereas before graph 1 and graph 2 had to be
executed sequentially (this may not be significant in your case, but is
very much so in other cases).

Any build system in which the only entry point is a task -- that is, you
must make a graph in order to make a graph -- would have to be run-run
to compensate its lack of dynamic support. Makefiles, for example.
That is why Rake is different -- you have the whole ruby language to
define your tasks, and then you say "go". This two-step approach is the
solution you seek.

Joe Wölfel · Sep 24, 2008

I don't see why it would be inefficient or require unnecessary

rebuilding.

The reason is because I have to build things before I know (or can
even determine programmatically) what other things need to be built.

Mike Gold · Sep 24, 2008

Joe said:
The reason is because I have to build things before I know (or can
even determine programmatically) what other things need to be built.

If you can't determine programmaticaly what is built, then how does a
program build it?

Even C/C++ dependencies, where you have no clue what g++ -MM is going to
spit out, can be handled with 'import' and the makefile loader.

If you are executing some other program which generates stuff, perhaps
you can add a flag where the program outputs what it *would* generate.
Capture that and 'import' it.

And if you can't add that flag, or if you otherwise don't know what is
being generated, then your hands are tied anyway. You can't know what's
going to happen, so you can't do anything about it. The two graphs are
worlds apart, and never the twain shall meet. In this case I wonder
what solution you could have expected.

Joe Wölfel · Sep 24, 2008

I didn't say what was being built couldn't be determined =20
programmatically. I said it couldn't be determined until certain =20
portions were already built. To build those things initial things I =20
need a build tool, such as Rake. If the suggestion is that I =20
shouldn't actually execute any Rake tasks until after I've determined =20=

all possible tasks then the catch 22 your talking about actually =20
occurs. The only practical solution I've come up with so far is to =20
have Rake build the initial targets and then call itself again to =20
determine the rest of the dependency graph and build the remaining =20
targets. If there were a way to augment the initial dependency =20
graph dynamically then this wouldn't be necessary. I just don't =20
happen to know of one.

James M. Lawrence · Sep 24, 2008

Joe said:
I didn't say what was being built couldn't be determined
programmatically. I said it couldn't be determined until certain
portions were already built. To build those things initial things I
need a build tool, such as Rake. If the suggestion is that I
shouldn't actually execute any Rake tasks until after I've determined
all possible tasks then the catch 22 your talking about actually
occurs. The only practical solution I've come up with so far is to
have Rake build the initial targets and then call itself again to
determine the rest of the dependency graph and build the remaining
targets. If there were a way to augment the initial dependency
graph dynamically then this wouldn't be necessary. I just don't
happen to know of one.

If you really cannot know what is going to be built, for example if a
program generates files whose names are taken from /dev/random and then
other tasks depend on those files, then you are in a pickle. Normally
this kind of thing is handled by 'import', but this assumes tasks can be
determined (for example examining the makedepend output).

What do you think of this:

task :setup_a do
puts "setup_a"
end

task :setup_b do
puts "setup_b"
end

task :setup => [:setup_a, :setup_b] do
puts "setup phase complete. defining new tasks..."

task :main_a do
puts "main_a"
end

task :main_b do
puts "main_b"
end

puts "restarting..."
throw :restart
end

task :main => [:main_a, :main_b] do
puts "main phase complete."
end
task :main_a => :setup
task :main_b => :setup

task :default => :main do
puts "all done."
end

% rake -f test/Rakefile.restart-flag
(in /Users/jlawrence/work/rake)
setup_a
setup_b
setup phase complete. defining new tasks...
restarting...
main_a
main_b
main phase complete.
all done.

I may be inflicting hardship on myself since this would complicate drake
(http://drake.rubyforge.org), but anyway... This patch is for regular
rake; the git branch is the same thing.

% git clone git://github.com/quix/rake.git
% cd rake
% git checkout -b restart-flag origin/restart-flag

diff --git a/lib/rake.rb b/lib/rake.rb
index 7c84f57..3010261 100755
--- a/lib/rake.rb
+++ b/lib/rake.rb
@@ -560,8 +560,15 @@ module Rake

# Invoke the task if it is needed. Prerequites are invoked first.
def invoke(*args)
- task_args = TaskArguments.new(arg_names, args)
- invoke_with_call_chain(task_args, InvocationChain::EMPTY)
+ catch

done) {
+ loop {
+ catch

restart) {
+ task_args = TaskArguments.new(arg_names, args)
+ invoke_with_call_chain(task_args, InvocationChain::EMPTY)
+ throw :done
+ }
+ }
+ }
end

# Same as invoke, but explicitly pass a call chain to detect
@@ -573,8 +580,8 @@ module Rake
puts "** Invoke #{name} #{format_trace_flags}"
end
return if @already_invoked
- @already_invoked = true
invoke_prerequisites(task_args, new_chain)
+ @already_invoked = true
execute(task_args) if needed?
end
end

James M. Lawrence · Sep 24, 2008

If only the Internet came with an Undo button...

Since in the previous example Rake complains unless main_a and main_b
are defined, it sort of defeats the whole purpose. This works:

task :setup_a do
puts "setup_a"
end

task :setup_b do
puts "setup_b"
end

task :setup => [:setup_a, :setup_b] do
puts "setup phase complete. defining new tasks..."

task :main_a do
puts "main_a"
end

task :main_b do
puts "main_b"
end

puts "restarting..."
throw :restart
end

task :main => [:main_a, :main_b] do
puts "main phase complete."
end

task :default => [:setup, :main] do
puts "all done."
end

However this defeats Drake, which I suppose is another matter.

Joe Wölfel · Sep 25, 2008

Thanks for the patch. Here's a clunkier variation on your
suggestion that seems to work with Drake. Stage 1 serializes an
unpredictable set of tasks. Stage 2 creates instances of them and
runs them if necessary. There might be a better way that involves
making the dependency tree modifiable dynamically. I think allowing
all possible dependency changes would get complicated. Maybe that
would require reevaluating the entire tree constantly and there's no
way to un-execute a task anyway. But most of the real world problems
I can think of seem to involve adding tasks that wouldn't have been
exercised yet anyway. Could this be solved with an improved
dependency tree walking algorithm?

require 'rake/clean'

# Stage 1 puts a random set of numbers in a file
STAGE_ONE_RESULTS = "s1.txt"
file STAGE_ONE_RESULTS do
open(STAGE_ONE_RESULTS, 'wb') do |file|
(1..5).map{|i|rand 10}.uniq.each do |i|
puts "stage1 creating dependency #{i}"
file.puts i
end
end
end
task :stage1 => STAGE_ONE_RESULTS

# Stage 2 creates task based on those random numbers
task :stage2 => :stage1
if File.exists? STAGE_ONE_RESULTS
IO.readlines(STAGE_ONE_RESULTS).each do |task_info|
task task_info do
puts "stage2 executing #{task_info}"
end
task :stage2 => task_info
end
end

task :all => :stage1 do
puts `drake -j4 stage2`
end

CLEAN.include STAGE_ONE_RESULTS
task :default => :all

James M. Lawrence · Sep 25, 2008

The following is a better implementation which could be made to work
with Drake. A patch for regular Rake follows.

task :setup_a do
puts "setup_a"
end

task :setup_b do
puts "setup_b"
end

task :setup => [:setup_a, :setup_b] do
puts "setup phase complete. defining new tasks..."

task :main_a do
puts "main_a"
end

task :main_b do
puts "main_b"
end

task :main => [:main_a, :main_b] do
puts "main phase complete."
end

puts "restarting..."
throw :restart
end

task :main

task :default => [:setup, :main] do
puts "all done."
end

diff --git a/lib/rake.rb b/lib/rake.rb
index 36c2734..1e360a6 100755
--- a/lib/rake.rb
+++ b/lib/rake.rb
@@ -573,8 +573,8 @@ module Rake
puts "** Invoke #{name} #{format_trace_flags}"
end
return if @already_invoked
- @already_invoked = true
invoke_prerequisites(task_args, new_chain)
+ @already_invoked = true
execute(task_args) if needed?
end
end
@@ -1994,7 +1994,14 @@ module Rake
elsif options.show_prereqs
display_prerequisites
else
- top_level_tasks.each { |task_name| invoke_task(task_name) }
+ catch

done) {
+ loop {
+ catch

restart) {
+ top_level_tasks.each { |task_name|
invoke_task(task_name) }
+ throw :done
+ }
+ }
+ }
end
end
end

Thanks for the patch. Here's a clunkier variation on your
suggestion that seems to work with Drake. Stage 1 serializes an
unpredictable set of tasks. Stage 2 creates instances of them and
runs them if necessary. There might be a better way that involves
making the dependency tree modifiable dynamically. I think allowing
all possible dependency changes would get complicated. Maybe that
would require reevaluating the entire tree constantly and there's no
way to un-execute a task anyway. But most of the real world problems
I can think of seem to involve adding tasks that wouldn't have been
exercised yet anyway. Could this be solved with an improved
dependency tree walking algorithm?

I think the best strategy is to cache the dynamic changes and update
only when the :restart flag is thrown. Fortunately, Drake is already
structured to work this way.

Drake does a dry run to collect all tasks to be executed, then passes
the dependency graph to my CompTree package which executes it in
parallel (CompTree is a kind of modest Erlang-in-Ruby).

It would be safe to add tasks during execution, as CompTree is running a
shallow copy of the dependency graph and will be unaware of any new
tasks or dependencies. I don't foresee any serious issues with simply
restarting the computation with a new shallow copy of the appended
graph.

Though Drake copies the dependency tree for unrelated reasons, it turns
out to be coincidentally useful here because it acts as a cache while
the user can append the original.

Though this is mostly brainstorming, I do see a need for a restart
feature, whether or not these ideas pan out. The :setup phase may
execute non-trivial tasks which will be obliviously re-executed by :main
in the separate process. We could assume the two stages comprise
disjoint sets, however it would be difficult to enforce. It's an
artificial restriction which will eventually fail.

In the example above, the :main gets executed in Rake and Drake for
entirely different reasons. In single-threaded Rake, the restart
happens before it even gets to :main, so :main did not get marked as
@already_invoked. In multi-threaded Drake, :main *does* get marked,
however after the restart its newly-created child nodes will still be
executed because CompTree will not even *consider* a node until all its
children have been computed.

One difference: in single-threaded Rake you must be careful to add tasks
"in the future", some place ahead in the sequential order of execution.
In my example :setup modifies :main, which is fine since the order given
is [:setup, :main]. In multi-threaded Drake you don't have to worry
about it, for reasons mentioned in the previous paragraph.

James M. Lawrence

Joe Wölfel · Sep 29, 2008

I've had a chance to play with your solution a bit. It is a much =20
better solution than my other earlier crude solution that relaunched =20
Rake. With your solution it seems all you need to do is call =20
restart at the end of a task that creates new tasks. It's very =20
simple. Everything just seems to work and previously executed tasks =20
aren't executed twice. In my solution they are executed twice, which =20=

is bad (or at least expensive). Also, in my solution I have to hard =20
code Rake parameters and it gets especially hairy when I have more =20
than one task that creates other tasks.

I noticed that your patch doesn't seem to work with multitask for =20
some reason. I'm not sure why. Also, as simple as it is to throw =20
restart I'm wondering if it's possible that this could be done =20
automatically - maybe with a warning for users who do it =20
inadvertently. If a task is defined at any point while a task is =20
executing then restart could be thrown automatically when the =20
executing task completes. Then Rake could automatically support =20
dynamic task creation. Would that make sense?

The following is a better implementation which could be made to work
with Drake. A patch for regular Rake follows.

task :setup_a do
puts "setup_a"
end

task :setup_b do
puts "setup_b"
end

task :setup =3D> [:setup_a, :setup_b] do
puts "setup phase complete. defining new tasks..."

task :main_a do
puts "main_a"
end

task :main_b do
puts "main_b"
end

task :main =3D> [:main_a, :main_b] do
puts "main phase complete."
end

puts "restarting..."
throw :restart
end

task :main

task :default =3D> [:setup, :main] do
puts "all done."
end

diff --git a/lib/rake.rb b/lib/rake.rb
index 36c2734..1e360a6 100755
--- a/lib/rake.rb
+++ b/lib/rake.rb
@@ -573,8 +573,8 @@ module Rake
puts "** Invoke #{name} #{format_trace_flags}"
end
return if @already_invoked
- @already_invoked =3D true
invoke_prerequisites(task_args, new_chain)
+ @already_invoked =3D true
execute(task_args) if needed?
end
end
@@ -1994,7 +1994,14 @@ module Rake
elsif options.show_prereqs
display_prerequisites
else
- top_level_tasks.each { |task_name| invoke_task(task_name) }
+ catchdone) {
+ loop {
+ catchrestart) {
+ top_level_tasks.each { |task_name|
invoke_task(task_name) }
+ throw :done
+ }
+ }
+ }
end
end
end

Thanks for the patch. Here's a clunkier variation on your
suggestion that seems to work with Drake. Stage 1 serializes an
unpredictable set of tasks. Stage 2 creates instances of them and
runs them if necessary. There might be a better way that involves
making the dependency tree modifiable dynamically. I think allowing
all possible dependency changes would get complicated. Maybe that
would require reevaluating the entire tree constantly and there's no
way to un-execute a task anyway. But most of the real world problems
I can think of seem to involve adding tasks that wouldn't have been
exercised yet anyway. Could this be solved with an improved
dependency tree walking algorithm?

Click to expand...

I think the best strategy is to cache the dynamic changes and update
only when the :restart flag is thrown. Fortunately, Drake is already
structured to work this way.

Drake does a dry run to collect all tasks to be executed, then passes
the dependency graph to my CompTree package which executes it in
parallel (CompTree is a kind of modest Erlang-in-Ruby).

It would be safe to add tasks during execution, as CompTree is =20
running a
shallow copy of the dependency graph and will be unaware of any new
tasks or dependencies. I don't foresee any serious issues with simply
restarting the computation with a new shallow copy of the appended
graph.

Though Drake copies the dependency tree for unrelated reasons, it =20
turns
out to be coincidentally useful here because it acts as a cache while
the user can append the original.

Though this is mostly brainstorming, I do see a need for a restart
feature, whether or not these ideas pan out. The :setup phase may
execute non-trivial tasks which will be obliviously re-executed =20
by :main
in the separate process. We could assume the two stages comprise
disjoint sets, however it would be difficult to enforce. It's an
artificial restriction which will eventually fail.

In the example above, the :main gets executed in Rake and Drake for
entirely different reasons. In single-threaded Rake, the restart
happens before it even gets to :main, so :main did not get marked as
@already_invoked. In multi-threaded Drake, :main *does* get marked,
however after the restart its newly-created child nodes will still be
executed because CompTree will not even *consider* a node until all =20=

its
children have been computed.

One difference: in single-threaded Rake you must be careful to add =20
tasks
"in the future", some place ahead in the sequential order of =20
execution.
In my example :setup modifies :main, which is fine since the order =20
given
is [:setup, :main]. In multi-threaded Drake you don't have to worry
about it, for reasons mentioned in the previous paragraph.

James M. Lawrence
--=20
Posted via http://www.ruby-forum.com/.

Testing Rake tasks with RSpec... and Rake	0	Apr 25, 2010
Using metaprogramming to refactor many similar Rake tasks?	1	Apr 23, 2010
ANN: Rake 0.8.4 Released	2	Mar 4, 2009
[ANN] Rake 0.8.2 Released	0	Sep 10, 2008
rakelibdir (was: Global rake tasks)	0	Oct 16, 2006
rake tasks aborting with weird postgres issue	0	Jun 23, 2008
rake task dependencies: via timestamps table in database	2	Jun 6, 2008
Equivalent for unix "read" command in rake tasks?	9	Nov 28, 2007

Rake dependencies unknown prior to running tasks

Joe Wölfel

Mike Gold

Mike Gold

Joe Wölfel

Mike Gold

Joe Wölfel

Mike Gold

Joe Wölfel

James M. Lawrence

James M. Lawrence

Joe Wölfel

James M. Lawrence

Joe Wölfel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads