Integrate OpusTrainer #161

eu9ene · 2023-08-05T00:17:04Z

OpusTrainer is a new training and data augmentation tool developed as a part of the HPLT project.

This tool can potentially help solve quality issues like casing (see #129 #73). However, it seems like the tool is designed for more advanced use cases and in future will run multiple training runs end-to-end including the training of the backward model. This doesn't fit in our architecture and backward training is already implemented as a separate step. Maybe we can use it only as a data augmentation tool.

eu9ene · 2023-10-04T21:51:48Z

I looked into OpusTrainer and I think we can use it for:

On-the-fly data augmentation using modifiers which is supposed to fix some issues like casing, translation of unseen words and symbols etc.
Simplifying the pipeline by replacing the two-step teacher training (pre-train and fine-tune) with one. We could use an OpusTrainer config like this one:

datasets:
  original: path/to/original # Original parallel corpus
  backtranslated: path/to/backtranslated # Back-translated data

stages:
  - pretrain
  - finetune

pretrain:
  - original 0.5
  - backtranslated 0.5
  - until original 2 # General training until 2 epochs of original

finetune:
  - original 1.0
  - backtranslated 0.0
  - until original inf # Fine-tuning only on original until the early stopping

Later, training using curriculum learning to have more fine-grained control on how to sequence datasets of different quality. We would likely need to split our datasets section in the pipeline config into several sections (dirty, mid, clean) for that. Similar to how it was done here (config, splitting). I found some papers (1, 2, 3) but it's not clear to me how much it would help to improve quality or speed of training in our case.

Question: If we do 1 and 2, how should our distillation recipe change? If we do on-the-fly augmentation only for the teacher and don't augment the student corpus, it might not learn to handle those edge cases as well as the teacher. Based on this paper, to perform sequence-level knowledge distillation a student is supposed to be trained on the exact outputs of the teacher, so doing on-the-fly augmentation for the student using OpusTrainer might break this process because the augmented training examples will not be equal to the results from the teacher. Alternatively, we could run the augmentation on the corpus, save it and then train both teacher and student on this corpus, but that's not how OpusTrainer works. @kpu @XapaJIaMnu What would be your recommendation for this? @marco-c FYI

XapaJIaMnu · 2023-10-05T15:30:26Z

How much the student can retain from the teacher is an empirical question and depends on the size of the student used. We have done limited experiments and it seems that depending on the language pair students behave differently.

As a starter, I would first assume that a teacher can perfectly learn to copy noise from source to target and translate tittle case and all caps text (it's not an entirely true assumption but it's what we aim for a teacher).

Then, I would just distil the training data with the teacher and apply the augmenters on top of it when training the student.

Alternatively, you can use OpusTrainer to produce the augmented training data and translate it via the teacher but then the variety of the augmented text would not be as rich.

Unfortunately, there isn't really any research on the subject and what gets you the best robust students is empirical.

eu9ene · 2023-10-05T18:31:45Z

Then, I would just distil the training data with the teacher and apply the augmenters on top of it when training the student.

Ok, let's start with the assumption that on-the-fly augmentation that imitates a "perfect" teacher will be properly approximated by the student model. I will integrate OpusTrainer with our main training script then and we'll be able to add augmentation to any training stage.

I suggest we also augment evaluation datasets for specific use cases like casing and separately run an evaluation of all models on them.

Then we can run experiments and see what happens.

XapaJIaMnu · 2023-10-05T23:10:28Z

I suggest we also augment evaluation datasets for specific use cases like casing and separately run an evaluation of all models on them.

I recently presented work at Machine translation marathon that does just that: https://mtm23.cs.ut.ee/wp-content/uploads/2023/09/Nikolay_Bogoychev_Robustness.pdf I haven't had the chance to write it up properly but basically, take a test set and use OpusTrainer to augment it. with the traits you are looking for.

eu9ene · 2023-10-06T00:14:06Z

Wow, it's impressive how the quality improves while you add more fixes. I'll post the results here for reference:

eu9ene added question Further information is requested quality Improving robustness and translation quality labels Aug 5, 2023

eu9ene self-assigned this Sep 22, 2023

eu9ene removed the question Further information is requested label Oct 10, 2023

eu9ene changed the title ~~Consider integration with OpusTrainer~~ Integrate OpusTrainer Oct 10, 2023

gabrielBusta mentioned this issue Oct 27, 2023

Training continuation #226

Merged

eu9ene mentioned this issue Oct 31, 2023

[meta] Improve translation robustness #238

Open

eu9ene mentioned this issue Nov 8, 2023

Integrate OpusTrainer #219

Merged

eu9ene closed this as completed in #219 Nov 18, 2023

eu9ene mentioned this issue Nov 20, 2023

Distillation is broken #272

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate OpusTrainer #161

Integrate OpusTrainer #161

eu9ene commented Aug 5, 2023

eu9ene commented Oct 4, 2023

XapaJIaMnu commented Oct 5, 2023

eu9ene commented Oct 5, 2023

XapaJIaMnu commented Oct 5, 2023 •

edited

Loading

eu9ene commented Oct 6, 2023

Integrate OpusTrainer #161

Integrate OpusTrainer #161

Comments

eu9ene commented Aug 5, 2023

eu9ene commented Oct 4, 2023

XapaJIaMnu commented Oct 5, 2023

eu9ene commented Oct 5, 2023

XapaJIaMnu commented Oct 5, 2023 • edited Loading

eu9ene commented Oct 6, 2023

XapaJIaMnu commented Oct 5, 2023 •

edited

Loading