-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate OpusTrainer #161
Comments
I looked into OpusTrainer and I think we can use it for:
Question: If we do 1 and 2, how should our distillation recipe change? If we do on-the-fly augmentation only for the teacher and don't augment the student corpus, it might not learn to handle those edge cases as well as the teacher. Based on this paper, to perform sequence-level knowledge distillation a student is supposed to be trained on the exact outputs of the teacher, so doing on-the-fly augmentation for the student using OpusTrainer might break this process because the augmented training examples will not be equal to the results from the teacher. Alternatively, we could run the augmentation on the corpus, save it and then train both teacher and student on this corpus, but that's not how OpusTrainer works. @kpu @XapaJIaMnu What would be your recommendation for this? @marco-c FYI |
How much the student can retain from the teacher is an empirical question and depends on the size of the student used. We have done limited experiments and it seems that depending on the language pair students behave differently. As a starter, I would first assume that a teacher can perfectly learn to copy noise from source to target and translate tittle case and all caps text (it's not an entirely true assumption but it's what we aim for a teacher). Then, I would just distil the training data with the teacher and apply the augmenters on top of it when training the student. Alternatively, you can use OpusTrainer to produce the augmented training data and translate it via the teacher but then the variety of the augmented text would not be as rich. Unfortunately, there isn't really any research on the subject and what gets you the best robust students is empirical. |
Ok, let's start with the assumption that on-the-fly augmentation that imitates a "perfect" teacher will be properly approximated by the student model. I will integrate OpusTrainer with our main training script then and we'll be able to add augmentation to any training stage. I suggest we also augment evaluation datasets for specific use cases like casing and separately run an evaluation of all models on them. Then we can run experiments and see what happens. |
I recently presented work at Machine translation marathon that does just that: https://mtm23.cs.ut.ee/wp-content/uploads/2023/09/Nikolay_Bogoychev_Robustness.pdf I haven't had the chance to write it up properly but basically, take a test set and use OpusTrainer to augment it. with the traits you are looking for. |
OpusTrainer is a new training and data augmentation tool developed as a part of the HPLT project.
This tool can potentially help solve quality issues like casing (see #129 #73). However, it seems like the tool is designed for more advanced use cases and in future will run multiple training runs end-to-end including the training of the backward model. This doesn't fit in our architecture and backward training is already implemented as a separate step. Maybe we can use it only as a data augmentation tool.
The text was updated successfully, but these errors were encountered: