Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate OpusTrainer #161

Closed
Tracked by #238
eu9ene opened this issue Aug 5, 2023 · 5 comments · Fixed by #219
Closed
Tracked by #238

Integrate OpusTrainer #161

eu9ene opened this issue Aug 5, 2023 · 5 comments · Fixed by #219
Assignees
Labels
quality Improving robustness and translation quality

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Aug 5, 2023

OpusTrainer is a new training and data augmentation tool developed as a part of the HPLT project.

This tool can potentially help solve quality issues like casing (see #129 #73). However, it seems like the tool is designed for more advanced use cases and in future will run multiple training runs end-to-end including the training of the backward model. This doesn't fit in our architecture and backward training is already implemented as a separate step. Maybe we can use it only as a data augmentation tool.

@eu9ene eu9ene added question Further information is requested quality Improving robustness and translation quality labels Aug 5, 2023
@eu9ene eu9ene self-assigned this Sep 22, 2023
@eu9ene
Copy link
Collaborator Author

eu9ene commented Oct 4, 2023

I looked into OpusTrainer and I think we can use it for:

  1. On-the-fly data augmentation using modifiers which is supposed to fix some issues like casing, translation of unseen words and symbols etc.
  2. Simplifying the pipeline by replacing the two-step teacher training (pre-train and fine-tune) with one. We could use an OpusTrainer config like this one:
datasets:
  original: path/to/original # Original parallel corpus
  backtranslated: path/to/backtranslated # Back-translated data

stages:
  - pretrain
  - finetune

pretrain:
  - original 0.5
  - backtranslated 0.5
  - until original 2 # General training until 2 epochs of original

finetune:
  - original 1.0
  - backtranslated 0.0
  - until original inf # Fine-tuning only on original until the early stopping
  1. Later, training using curriculum learning to have more fine-grained control on how to sequence datasets of different quality. We would likely need to split our datasets section in the pipeline config into several sections (dirty, mid, clean) for that. Similar to how it was done here (config, splitting). I found some papers (1, 2, 3) but it's not clear to me how much it would help to improve quality or speed of training in our case.

Question: If we do 1 and 2, how should our distillation recipe change? If we do on-the-fly augmentation only for the teacher and don't augment the student corpus, it might not learn to handle those edge cases as well as the teacher. Based on this paper, to perform sequence-level knowledge distillation a student is supposed to be trained on the exact outputs of the teacher, so doing on-the-fly augmentation for the student using OpusTrainer might break this process because the augmented training examples will not be equal to the results from the teacher. Alternatively, we could run the augmentation on the corpus, save it and then train both teacher and student on this corpus, but that's not how OpusTrainer works. @kpu @XapaJIaMnu What would be your recommendation for this? @marco-c FYI

@XapaJIaMnu
Copy link
Contributor

How much the student can retain from the teacher is an empirical question and depends on the size of the student used. We have done limited experiments and it seems that depending on the language pair students behave differently.

As a starter, I would first assume that a teacher can perfectly learn to copy noise from source to target and translate tittle case and all caps text (it's not an entirely true assumption but it's what we aim for a teacher).

Then, I would just distil the training data with the teacher and apply the augmenters on top of it when training the student.

Alternatively, you can use OpusTrainer to produce the augmented training data and translate it via the teacher but then the variety of the augmented text would not be as rich.

Unfortunately, there isn't really any research on the subject and what gets you the best robust students is empirical.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Oct 5, 2023

Then, I would just distil the training data with the teacher and apply the augmenters on top of it when training the student.

Ok, let's start with the assumption that on-the-fly augmentation that imitates a "perfect" teacher will be properly approximated by the student model. I will integrate OpusTrainer with our main training script then and we'll be able to add augmentation to any training stage.

I suggest we also augment evaluation datasets for specific use cases like casing and separately run an evaluation of all models on them.

Then we can run experiments and see what happens.

@XapaJIaMnu
Copy link
Contributor

XapaJIaMnu commented Oct 5, 2023

I suggest we also augment evaluation datasets for specific use cases like casing and separately run an evaluation of all models on them.

I recently presented work at Machine translation marathon that does just that: https://mtm23.cs.ut.ee/wp-content/uploads/2023/09/Nikolay_Bogoychev_Robustness.pdf I haven't had the chance to write it up properly but basically, take a test set and use OpusTrainer to augment it. with the traits you are looking for.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Oct 6, 2023

Wow, it's impressive how the quality improves while you add more fixes. I'll post the results here for reference:

Screenshot 2023-10-05 at 4 55 11 PM Screenshot 2023-10-05 at 4 55 20 PM

@eu9ene eu9ene removed the question Further information is requested label Oct 10, 2023
@eu9ene eu9ene changed the title Consider integration with OpusTrainer Integrate OpusTrainer Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants