Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't restart the pipeline to run distillation from where it left #711

Open
Tracked by #311
eu9ene opened this issue Jun 28, 2024 · 1 comment
Open
Tracked by #311

Can't restart the pipeline to run distillation from where it left #711

eu9ene opened this issue Jun 28, 2024 · 1 comment
Labels
taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jun 28, 2024

I'm trying to continue the pipeline after we get the results of the evaluate-teacher-ensemble task.

By doing something like:

target-stage: all
start-stage: translate-corpus
previous_group_ids: ["N1O85rIASLmCwKfUpCvTlw"]

or

start-stage: score

I'm getting an error:

[vcs 2024-06-28T00:15:22.294Z] TinderboxPrint:<a href='https://github.com/mozilla/firefox-translations-training/commit/e65c859948775bdf70b1eda8a7c233936d8e53c4' title='Built from firefox-translations-training commit e65c859948775bdf70b1eda8a7c233936d8e53c4'>e65c859948775bdf70b1eda8a7c233936d8e53c4</a>
[task 2024-06-28T00:15:22.294Z] executing ['bash', '-cx', 'cd /builds/worker/checkouts/src && ln -s /builds/worker/artifacts artifacts && taskgraph action-callback\n']
[task 2024-06-28T00:15:22.296Z] + cd /builds/worker/checkouts/src
[task 2024-06-28T00:15:22.296Z] + ln -s /builds/worker/artifacts artifacts
[task 2024-06-28T00:15:22.297Z] + taskgraph action-callback
[task 2024-06-28T00:15:23.187Z] Traceback (most recent call last):
[task 2024-06-28T00:15:23.188Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/main.py", line 708, in action_callback
[task 2024-06-28T00:15:23.189Z]     return trigger_action_callback(
[task 2024-06-28T00:15:23.190Z]            ^^^^^^^^^^^^^^^^^^^^^^^^
[task 2024-06-28T00:15:23.191Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/actions/registry.py", line 345, in trigger_action_callback
[task 2024-06-28T00:15:23.191Z]     cb(Parameters(**parameters), graph_config, input, task_group_id, task_id)
[task 2024-06-28T00:15:23.191Z]   File "/builds/worker/checkouts/src/taskcluster/translations_taskgraph/actions/train.py", line 397, in train_action
[task 2024-06-28T00:15:23.191Z]     start_task_ids.append(label_to_task_id[label])
[task 2024-06-28T00:15:23.191Z]                           ~~~~~~~~~~~~~~~~^^^^^^^
[task 2024-06-28T00:15:23.191Z] KeyError: 'translate-corpus-da-en-1/20'

Completed group

log1
log2

Workaround:

start-stage: evaluate-teacher-ensemble

It reruns the evaluation again but at least schedules other things properly.

@eu9ene eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Jun 28, 2024
@bhearsum
Copy link
Collaborator

bhearsum commented Jul 9, 2024

The start-stage system requires that one of the previous groups contains the start-stage task. (This is a documented caveat in https://github.com/mozilla/firefox-translations-training/blob/main/docs/task-cluster.md#running-only-later-parts-of-the-pipeline (originally from #377)).

This is because in any case where multiple previous_group_ids are specified, we could end up with conflicts for the same label, and end up replacing with the wrong task. (And in general, the behaviour would become non-deterministic, which is not great...)

We could conceivably fix this by allowing start_task_ids to be specified explicitly (which we'd either use in addition to or instead of the automatically detected ones. Or perhaps we should wait until we discuss #719 more before adding more hacks here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

2 participants