Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Student training continuation is regressed #893

Closed
gregtatum opened this issue Oct 22, 2024 · 3 comments · Fixed by #895
Closed

Student training continuation is regressed #893

gregtatum opened this issue Oct 22, 2024 · 3 comments · Fixed by #895
Assignees
Labels
bug Something is broken or not correct

Comments

@gregtatum
Copy link
Member

It looks like #881 broke training continuation for students. There is some mis-direction around the train taskcluster script I think needs updating. I don't think we have run_task tests for training continuation. I'm guessing it's some of the argument manipulation which I don't understand in taskcluster/scripts/pipeline/train_taskcluster.py.

train.py: error: argument --student_model: invalid StudentModel value: 'continue'
@gregtatum gregtatum added the bug Something is broken or not correct label Oct 22, 2024
@eu9ene
Copy link
Collaborator

eu9ene commented Oct 22, 2024

I'll have a look. Do I understand correctly that we need this working to use preemptible instances? @bhearsum

@eu9ene eu9ene self-assigned this Oct 22, 2024
@bhearsum
Copy link
Collaborator

Yeah, we should make sure continuation works if we're using preemptible instances. The other options are: don't use preemptible instances or change it to not try to continue training (which really isn't a good option...).

It looks like https://github.com/mozilla/firefox-translations-training/blob/b0b5f25d0289a90619a12e645683cfd671332a85/taskcluster/scripts/pipeline/train_taskcluster.py#L35-L38 just needs a bump. Sorry for not catching that in review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants