-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve usability of running selected tasks #719
Comments
I agree that we need to step back and take a more holistic look at how to deal with this. We don't really have a great analogue for "things on disk" in Taskcluster. The closest is the cached tasks, but obviously we've run into a number of issues with that in the past, and most of the tools listed above have been workarounds for issues we've had with it. To be quite honest, if the level of flexibility that Snakemake provided (ie: being able to manipulate it by adjusting things on disk) is needed, it may be worth considering reviving it, and perhaps adding the ability to scale into the cloud with Slurm. I seriously doubt that Taskcluster will be able to provide the same level of flexibility and simplicity at the same time. With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled. Whatever the goals and/or use cases are, having a complete list of them written down would be very helpful. We can't design a solution until we know the extent of the problem! |
Incidentally, #683 has largely implemented this already when it comes to pulling from tasks (we can still only pull pretrained models from elsewhere). As we use it a bit more, it may help us evaluate this option. One implication of taking things to this extreme is that I imagine we'd need a fairly good cli tool and/or UI to make it practical to do this. It wouldn't be shocking to end up pointing at 5 or 6 different locations (be they task groups or buckets), and it would be unfortunate if it ends up taking hours to collect all of the existing tasks or locations of artifacts and/or it ends up error prone. |
@eu9ene, @ahal, and I met today to talk about this. There's some fairly extensive notes taken, but here's a high level summary with the main takeaways:
|
While working on the big training and bug fixes I ran into many issues with scheduling specific tasks. Basically, the graph and caches can be in an arbitrary state and we still should be able to run the pipeline starting with specific stages and reusing the stages that ran before.
We currently have several tools to work with:
target-stage
start_stage
andprevious_group_ids
existing_tasks
Usually, I see there's an issue when it starts scheduling tasks I don't need to schedule and I try to come up with a workaround using those tools. Also, using the current tools adds a significant mental load and is hard to use when training 10s of languages with fixes at the same time (see this PR). We should rethink this approach to make it more flexible and easy to use.
Maybe introducing a concept of state similar to the data on disk in Snakemake can help here. There was an option to skip smart scheduling based on file creation dates and information about the past runs and just treat everything present on disk as completed tasks and schedule the rest.
The text was updated successfully, but these errors were encountered: