Improve usability of running selected tasks #719

eu9ene · 2024-07-02T18:12:28Z

While working on the big training and bug fixes I ran into many issues with scheduling specific tasks. Basically, the graph and caches can be in an arbitrary state and we still should be able to run the pipeline starting with specific stages and reusing the stages that ran before.

We currently have several tools to work with:

target-stage
start_stage and previous_group_ids
existing_tasks
pre-trained models
Git branches
Adding extra tasks to the graph

Usually, I see there's an issue when it starts scheduling tasks I don't need to schedule and I try to come up with a workaround using those tools. Also, using the current tools adds a significant mental load and is hard to use when training 10s of languages with fixes at the same time (see this PR). We should rethink this approach to make it more flexible and easy to use.

Maybe introducing a concept of state similar to the data on disk in Snakemake can help here. There was an option to skip smart scheduling based on file creation dates and information about the past runs and just treat everything present on disk as completed tasks and schedule the rest.

The text was updated successfully, but these errors were encountered:

bhearsum · 2024-07-09T17:16:50Z

I agree that we need to step back and take a more holistic look at how to deal with this.

We don't really have a great analogue for "things on disk" in Taskcluster. The closest is the cached tasks, but obviously we've run into a number of issues with that in the past, and most of the tools listed above have been workarounds for issues we've had with it. To be quite honest, if the level of flexibility that Snakemake provided (ie: being able to manipulate it by adjusting things on disk) is needed, it may be worth considering reviving it, and perhaps adding the ability to scale into the cloud with Slurm. I seriously doubt that Taskcluster will be able to provide the same level of flexibility and simplicity at the same time.

With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled.

Whatever the goals and/or use cases are, having a complete list of them written down would be very helpful. We can't design a solution until we know the extent of the problem!

bhearsum · 2024-08-13T18:36:02Z

With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled.

Incidentally, #683 has largely implemented this already when it comes to pulling from tasks (we can still only pull pretrained models from elsewhere). As we use it a bit more, it may help us evaluate this option.

One implication of taking things to this extreme is that I imagine we'd need a fairly good cli tool and/or UI to make it practical to do this. It wouldn't be shocking to end up pointing at 5 or 6 different locations (be they task groups or buckets), and it would be unfortunate if it ends up taking hours to collect all of the existing tasks or locations of artifacts and/or it ends up error prone.

bhearsum · 2024-10-08T19:48:27Z

@eu9ene, @ahal, and I met today to talk about this. There's some fairly extensive notes taken, but here's a high level summary with the main takeaways:

existing_tasks is the most flexible mechanism for adjusting dependencies that taskgraph provides, and is the most comparable thing to the way we used to use files on disk to move things along with Snakemake
Populating existing_tasks is not necessarily a trivial thing. Ideally, all upstream tasks would end up in it, which will often be from many different task groups. To do this effectively, we need good tooling that can help us to find and sort through possible candidates for upstream tasks we wish to reuse.
- It is unclear if UI is a necessity for this, or if a CLI tool will be enough.
- No tweaking of any of the taskcluster code should be needed to try this out, as it will already prioritize existing_tasks ahead of anything else.
One thing that might help us find these is if we have one central place that had all of the translations tasks in it (this information is in the Taskcluster database, but there’s no way to query it, eg: by experiment name or language pair)
- One idea we came up with here is to add this information to a different database either when training actions run, or possibly when each task begins to execute.
- Another way we might be able to do this is to add additional indexes to tasks (eg: ones with the experiment name in them, and a unique identifier to avoid being overridden by reruns). This would allow us to query the Taskcluster index by some prefix and find all of the tasks within it.
On a different note, we also talked about some ways to make task building and DAG generation different/easier. @ahal suggested that one way to try this out could be write a custom loader that would execute a Python script to build the initial kind data that gets fed to transforms. (This would replace the existing kinds.)
- This is very speculative and would be experimental, but it’s the most promising idea we have for making DAGs more dynamically configurable.

eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Jul 2, 2024

eu9ene mentioned this issue Jul 2, 2024

[meta] Make the pipeline reliable enough to train many languages #311

Open

bhearsum mentioned this issue Jul 9, 2024

Can't restart the pipeline to run distillation from where it left #711

Open

eu9ene mentioned this issue Jul 9, 2024

Exclude start stage tasks from existing tasks #713

Open

bhearsum mentioned this issue Jul 29, 2024

start_stage often reruns amost all "evaluate" tasks #728

Open

eu9ene mentioned this issue Oct 21, 2024

[meta] Retrain older high resource models #891

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve usability of running selected tasks #719

Improve usability of running selected tasks #719

eu9ene commented Jul 2, 2024 •

edited

Loading

bhearsum commented Jul 9, 2024

bhearsum commented Aug 13, 2024

bhearsum commented Oct 8, 2024

Improve usability of running selected tasks #719

Improve usability of running selected tasks #719

Comments

eu9ene commented Jul 2, 2024 • edited Loading

bhearsum commented Jul 9, 2024

bhearsum commented Aug 13, 2024

bhearsum commented Oct 8, 2024

eu9ene commented Jul 2, 2024 •

edited

Loading