Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint targets #88

Open
myronahn opened this issue Sep 5, 2013 · 11 comments
Open

Checkpoint targets #88

myronahn opened this issue Sep 5, 2013 · 11 comments
Labels

Comments

@myronahn
Copy link
Contributor

myronahn commented Sep 5, 2013

(Will fill in more detail and a proposal soon)

Allow the ability to specify checkpoint targets in the workflow - when the workflow generates a checkpoint target, it will delete all intermediate targets generated to reach that checkpoint, but will not delete intermediate targets needed for subsequent targets or that are checkpoint targets themselves.

@aboytsov
Copy link
Contributor

Would love an example :)

@myronahn
Copy link
Contributor Author

Prototype: c0947a5

@myronahn
Copy link
Contributor Author

I've been thinking about this - it might be better to mark parts of the workflow that are "temporary" (i.e. intermediate targets) rather than mark ones that are permanent. This way, everything by default is permanent (a "checkpoint") and you can gradually subtract out what you don't need permanently.

Major components:

  • Syntax of marking a target as "temporary"
    • Mark a target as temporary by prepending a tilde, e.g. ~a
    • Can mark a target as temporary in either the input or output list of a step
    • If a target is marked anywhere as temporary, it is "globally" considered to be a temporary file, even in steps where it is not marked as temporary.
  • What happens during the build?
    • They are mostly treated as normal targets
    • As soon as all the steps that rely on a temporary target are built, the target is deleted (if possible in the underlying fs)
    • Have an option to turn off deletion of temp targets --keep-temp-files
  • Effects on the dependency tree
    • The problem is, we don't want deleted temp targets to unnecessarily trigger the re-execution of a step
    • e.g. if we have steps C <- ~B and ~B <- A, if B is deleted, we don't want to re-trigger these steps as long as C is newer than A
    • Algorithm: when checking input target timestamps against output target timestamps:
      • If a temp input target has data, treat it as a normal target
      • If a temp input target has no data, ignore it
      • If a temp output target has data, treat it as a normal target
      • If a temp output target has no data, recursively expand it to be the output targets of all the steps that depend on this step (that are in the current build tree). Recursively apply the rules for temp output targets.
      • Then you can compare input target dates against output target dates normally

I have a prototype of this behavior now.

@myronahn
Copy link
Contributor Author

Simple example:

~b <- a
  commands to build b from a

~c <- b
  commands to build c from b

~d <- c
  commands to build d from c

e <- d
  commands to build e from d

f <- d
  commands to build f from d

b, c, and d are temporary or intermediate files, and are globally marked as such.

Running from scratch:

  • b is built
  • c is built
  • b is deleted after c is built
  • d is built
  • c is deleted after d is built
  • e is built
  • f is built
  • d is deleted after e and f are built

@myronahn
Copy link
Contributor Author

It is probably good to establish a convention of marking temp targets in the output list - I think it is more clear this way.

@aboytsov
Copy link
Contributor

aboytsov commented Nov 1, 2013

I like the direction it's going to. I agree this is much clearer than defining checkpoints, and I think this feature could be useful. I wanted to make a comment that temp status should only be defined in outputs, but then I've noticed you made that comment already. :)

I think timestamp-wise, you can simply propagate the dates of inputs to a temp output, i.e.:

~c <- a, b
d <- ~c
   if c is missing, max(a, b) will be assumed to be its timestamp during the dependency evaluation stage

I do feel, however, that it we could be missing a bunch of things this behavior can affect, and we should make sure we cover everything, for example:

  1. invocation: is it drake +^~c or drake +^c? I would vote for ~ to be outside of referencing.
  2. if drake +^c is invoked, c should be built and then d. but drake +=c is logically a no-op, correct? the distinction between explicitly specifying a target in the command-line and pulling it through dependencies is actually a moot one and I don't recommend relying on it.
  3. this can make things tricky if one wants to build things in stages, which could theoretically be addressed by your ---keep_temp_files flag.
  4. it's also a bit unclear if a subset of the dependency tree is selected, how you will judge if a temporary file should be deleted. it's probably risky to evaluate this relative to the subset rather than the whole graph, but the latter has nuances, too. i think it's possible to come up with a consistent behavior, but we should be able to articulate it in a way easily understood.
  5. not sure what you mean by "has no data" - you mean "missing", right? i read "has no data" as "exists, but empty", but i think this is not what you meant.

Finally, I am really excited to see Drake's development marches on! You've been doing awesome work on it, lately, Myron, and I think this is useful to a lot of people outside Factual as well! You guys rock!

@myronahn
Copy link
Contributor Author

myronahn commented Nov 1, 2013

@aboytsov I'm really glad you were able to take a look at this, and also very happy that you like the direction of these changes! Finally, I'm glad to see that some of my work is appreciated 👍

To address your comments:

re: timestamp checking - I considered several methods for timestamp checking on temp files that are missing.

  • Reach upwards (what you proposed, I think): if an input is temp/missing, recursively reach upwards and check timestamps of the parent's input files instead.
  • Reach downwards (current implementation): if an output is temp/missing, recursively reach downwards check timestamps of the children's output files instead.
  • Reach upwards/downwards: basically a combination of the two.
    It turns out that reaching downwards is easier, since when you reach upwards, you have to consider the special case of a step having no inputs and forcing the build based on this. There is no special case reaching downwards.

@myronahn
Copy link
Contributor Author

myronahn commented Nov 1, 2013

More comments:

  1. I agree, I think drake +^c is better - coincidentally, this is how it works now.
  2. Hmm, good question. I think drake +=c should work the same as if c were not a temp target, that is, it should try to build it w/o building dependencies, since the user is asking for c. I believe this is how it works now.
  3. Yes, true enough - on the other hand, I'd assume that the workflow would define appropriate checkpoints that correspond to stages. And you're right - if all else fails, the --keep-temp-files flag can be used. I'd assume that people would start using temp targets once a workflow is fairly tested and mature and would use --keep-temp-files (or not even use temp targets) if they were actively developing it.
  4. Good point, I thought about this as well. Right now if a subset of a dependency tree is selected, I only delete the file based on the dependencies on the temp file in the subset. This can be a bit surprising, you're right (especially if you want to run other parts of the workflow later), and you're right that the alternative (delete a target based on the dependencies in the full tree) is also surprising, and in my opinion, even more full of pitfalls. I'm open to suggestions here - perhaps a flag to flip between the two alternatives?
  5. Ah, my bad, I meant missing, or more specifically, (not (fs data/in? target))

@myronahn
Copy link
Contributor Author

myronahn commented Nov 1, 2013

As an overall philosophy, I tried to make it so that if the user doesn't use temp targets, then Drake will act exactly the same as before.

If the user wants to use temp targets, he/she should definitely be warned in the docs that:

  • It is an advanced feature that may be surprising.
  • It will automatically recursively delete stuff off of the file system.
  • It might slow things down if expensive temp targets are constantly being rebuilt.
  • One should only use it on a fully debugged/mature workflow as a form of space optimization.
  • One should definitely know about the --keep-temp-files flag.

@myronahn
Copy link
Contributor Author

myronahn commented Nov 1, 2013

Oh, and I'll modify the parser so temp files can only be defined in the output list.

myronahn pushed a commit that referenced this issue Nov 1, 2013
Squashed all the commits into one - in another branch
myronahn pushed a commit that referenced this issue Nov 1, 2013
myronahn pushed a commit that referenced this issue Nov 1, 2013
myronahn pushed a commit that referenced this issue Nov 1, 2013
Squashed all the commits into one - in another branch
myronahn pushed a commit that referenced this issue Nov 5, 2013
myronahn pushed a commit that referenced this issue Nov 8, 2013
myronahn pushed a commit that referenced this issue Nov 14, 2013
#106
Standardize on name "temp targets"
Use -> for cleaner code
Add comments for ramifications of error when deleting temp target
Also: made sure temp target testing is in the regtest suite
myronahn pushed a commit that referenced this issue Jan 22, 2014
myronahn pushed a commit that referenced this issue Jan 22, 2014
#106
Standardize on name "temp targets"
Use -> for cleaner code
Add comments for ramifications of error when deleting temp target
Also: made sure temp target testing is in the regtest suite

Conflicts:
	resources/regtest/run-all.sh
	src/drake/core.clj
@calfzhou
Copy link
Contributor

calfzhou commented May 6, 2014

This feature is just what i'm looking for. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants