From ec14ae5e613f51f7c1fbc03684999caf216b03d2 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Tue, 31 Oct 2023 14:30:05 -0700 Subject: [PATCH 01/26] Update training guide --- Makefile | 4 +- docs/cleaning.md | 92 ++++++++++++++ docs/data.md | 31 ----- docs/opus-cleaner.md | 47 ------- docs/references.md | 1 - docs/training-guide.md | 269 ++++++++++++++++++++++++++++++----------- 6 files changed, 294 insertions(+), 150 deletions(-) create mode 100644 docs/cleaning.md delete mode 100644 docs/opus-cleaner.md diff --git a/Makefile b/Makefile index e6c7914e0..169ec6c56 100644 --- a/Makefile +++ b/Makefile @@ -125,13 +125,13 @@ tensorboard: ################################################ # OpusCleaner is a data cleaner for training corpus -# More details are in docs/opus-cleaner.md +# More details are in docs/cleaning.md opuscleaner-ui: poetry install --only opuscleaner opuscleaner-server serve --host=0.0.0.0 --port=8000 # Utils to find corpus etc -install utils: +install-utils: poetry install --only utils # Black is a code formatter for Python files. Running this command will check that diff --git a/docs/cleaning.md b/docs/cleaning.md new file mode 100644 index 000000000..d38fb065e --- /dev/null +++ b/docs/cleaning.md @@ -0,0 +1,92 @@ +# Data cleaning + +Making datasets less noisy to improve quality of translation. + +## Regular pipeline + + +Config setting: +``` + use-opuscleaner: false +``` + +### Dataset fixing + +Some datasets require fixes like detokenization. +Dataset and language specific fixes are implemented in [/pipeline/clean/fixes](/pipeline/clean/fixes). +Naming convention: +- `.sh` for parallel dataset cleaning +- `..sh` for language specific cleaning of parallel or monolingual dataset +- `/` in dataset name should be replaced with `_` + +### Cleaning scripts + +Make sure the language is present in [clean_parallel](/pipeline/clean/tools/clean_parallel.py#L19) script. + + +### Bicleaner + +It is recommended to use Bicleaner ML models to filter noisy data. +Check that the bicleaner-ai model is [available](https://object.pouta.csc.fi/OPUS-ELRC-3069-wikipedia_health) +and add filtering thresholds to the config. + +- `0.5` should be a good default value. +- Noisier datasets like OpenSubtitles should have higher threshold. +- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned. + +``` + bicleaner: + default-threshold: 0.5 + dataset-thresholds: + opus_CCAligned/v1: 0.7 + opus_OpenSubtitles/v2018: 0.8 + opus_ParaCrawl/v8: 0 + ... +``` + +## OpusCleaner + +Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project. + +Config setting: +``` + use-opuscleaner: true +``` + +## Custom filter configs +The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset +to get a training corpus with less noise and train higher quality translation models. + +Filtering rules can be tuned in an interactive UI. + +### Installation + +Install the OpusCleaner UI on a server. +See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner). + +For local usage: run from a poetry shell `make opuscleaner-ui`. +Then go to `http://0.0.0.0:8000`. + +### Making filters + +Choose a language pair and download the required OPUS datasets. +They will correspond to `opus_...` training datasets in the training pipeline config. + +Configure cleaning rules for the datasets in the UI. + +Copy JSON files for the produced filters `data/train-parts/*.filter.json` to +`pipeline/clean/opuscleaner/configs/-/`. + +### Default config + +If no custom config was specifed for the dataset, +the [default config template](/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. + +Modify if needed. Some rules require specifying source or target language. +The `` and `` in the template will be automatically replaced with the trained language pair. +The generated default config will be copied to the target dataset cleaning directory. + +### Running + +Enable OpusCleaner in the training pipeline config and run the pipeline as usual. +OpusCleaner will replace the default [clean-corpus](/pipeline/clean/clean-corpus.sh) script. diff --git a/docs/data.md b/docs/data.md index 3ef36848a..8e7e4468f 100644 --- a/docs/data.md +++ b/docs/data.md @@ -40,34 +40,3 @@ Make sure to check licenses of the datasets before using them. Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `.sh` and accepts the same parameters as the other scripts from the same folder. - -## Dataset fixing - -Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in [pipeline/clean/fixes](/pipeline/clean/fixes). -Naming convention: -- `.sh` for parallel dataset cleaning -- `..sh` for language specific cleaning of parallel or monolingual dataset -- `/` in dataset name should be replaced with `_` - -## Dataset cleaning -Some parallel datasets require more aggressive filtering. -Dataset specific Bicleaner thresholds can be set in config. -`0` means skipping filtering entirely (useful for Paracrawl). - -Example: - -``` -experiment: -... - bicleaner: - default-threshold: 0.5 - dataset-thresholds: - opus_ParaCrawl/v8: 0 - mtdata_neulab_tedtalksv1_train: 0.6 -``` - -### OpusCleaner - -Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project. - -See more details in the [dedicated doc](opus-cleaner.md). diff --git a/docs/opus-cleaner.md b/docs/opus-cleaner.md deleted file mode 100644 index 29a031af1..000000000 --- a/docs/opus-cleaner.md +++ /dev/null @@ -1,47 +0,0 @@ -# OpusCleaner - -The instructions on using the [OpusCleaner](https://github.com/hplt-project/OpusCleaner) tool. - -## Custom filter configs -The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset -to get a training corpus with less noise and train higher quality translation models. - -Filtering rules can be tuned in an interactive UI. - -### Installation - -Install the OpusCleaner UI on a server. -See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner). - -For local usage: run from a poetry shell `make opuscleaner-ui`. -Then go to `http://0.0.0.0:8000`. - -### Making filters - -Choose a language pair and download the required OPUS datasets. -They will correspond to `opus_...` training datasets in the training pipeline config. - -Configure cleaning rules for the datasets in the UI. - -Copy JSON files for the produced filters `data/train-parts/*.filter.json` to -`pipeline/clean/opuscleaner/configs/-/`. - -## Default config - -If no custom config was specifed for the dataset, -the [default config template](/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. - -Modify if needed. Some rules require specifying source or target language. -The `` and `` in the template will be automatically replaced with the trained language pair. -The generated default config will be copied to the target dataset cleaning directory. - -## Running - -Enable OpusCleaner in the training pipeline config -``` -experiment: - ... - use-opuscleaner: true -``` - -Run the pipeline as usual. OpusCleaner will replace the default [clean-corpus](/pipeline/clean/clean-corpus.sh) script. diff --git a/docs/references.md b/docs/references.md index 0069ddca6..751f6ca3e 100644 --- a/docs/references.md +++ b/docs/references.md @@ -15,7 +15,6 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020 3. Mölder F, Jablonski KP, Letcher B, et al. [Sustainable data analysis with Snakemake](https://pubmed.ncbi.nlm.nih.gov/34035898/). F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2 - 4. [Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task](https://aclanthology.org/2020.ngt-1.26) (Bogoychev et al., NGT 2020) 5. [From Research to Production and Back: Ludicrously Fast Neural Machine Translation](https://aclanthology.org/D19-5632) (Kim et al., EMNLP 2019) diff --git a/docs/training-guide.md b/docs/training-guide.md index 5f02be1d0..88715096f 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -1,17 +1,165 @@ # Model training guide -First of all, choose a language pair to train. +A step-by-step guide on how to train a translation model. -## Configuration -Clone the repo and follow the instructions that correspond to the workflow manager you will be using -([Taskcluster](task-cluster.md), [Snakemake](snakemake.md)). +The configuration of the training run happens mostly in the training configuration file. +Look at the examples of the full production configs for [Taskcluster](/configs/tc.prod.yml) and [Snakemake](/configs/config.prod.yml). -The Marian workspace is usually safe to set to about 3/4 of available GPU memory -(in a [profile for Snakemake](/pipeline/train/train.sh) and throughout the ci steps in Task cluster). +## 1. Choose a language + +First, choose a language pair to train. + +Considerations: +- The size of the parallel corpus on [OPUS](https://opus.nlpl.eu/) +- Availability of monolingual data. The pipeline requires monolingual data in both source and target languages. + Currently we support automatic donwloading only for [news crawl](https://data.statmt.org/news-crawl/) +- Availability of [bicleaner-ai models](https://object.pouta.csc.fi/OPUS-ELRC-3069-wikipedia_health) + +Set the language pair and a name of the experiment in the config: +``` +experiment: + name: test-quality + src: ru + trg: en +``` + +## 2. Find datasets + +### Parallel corpus +1. Go to [OPUS](https://opus.nlpl.eu/) and see how much data is available for the language pair +2. Go to [statmt22](https://www.statmt.org/wmt22/translation-task.html), [statmt21](https://www.statmt.org/wmt21/translation-task.html) etc. + and check if the language pair participated in a competition. + If yes, there's a good chance some extra data is available for training. +3. Use [find-corpus](/utils/find-corpus.py) tool to get OPUS datasets. +Install [poetry](https://python-poetry.org/) first, then run: +``` +make install-utils +python utils/find-corpus.py en ru opus +``` +5. In the same way search for mtdata datasets +``` +python utils/find-corpus.py en ru mtdata +``` +6. Look what's there and remove old versions of datasets + (for example there should be only mtdata paracrawl v9 left like `mtdata_ParaCrawl-paracrawl-9-eng-swe`) +7. Deduplicate datasets between OPUS and mtdata (for example, remove `opus_ParaCrawl/v8`). + If the versions are the same I prefer OPUS ones as a more stable resource. + +Copy the datasets in the training config: +``` +datasets: + train: + - opus_ada83/v1 + - mtdata_Statmt-news_commentary-15-eng-rus + ... +``` +It's hard to say how much data is required to train something useful. +My guess would be at least 10 million sentences. Ideally 100M+. + + +### Evaluation datasets +- There might be statmt datasets available. For example `sacrebleu_wmt20`. + Run find-corpus to search using the [SacreBLEU tool](https://github.com/mjpost/sacrebleu): +``` +python utils/find-corpus.py en ru sacrebleu +``` +- Use some datasets for validation while training (`datasets.devtest` section) and others for evaluation (`datasets.test`). +- Flores dataset is available for 100 languages, so it's always a good idea to add `flores_dev` for validation and `flores_devtest` for the final evaluation of the model. +- Make sure that training, validation and evaluation datasets are different. + +``` + # datasets to merge for validation while training + devtest: + - flores_dev + - sacrebleu_wmt19 + - sacrebleu_wmt17 + # datasets for evaluation + test: + - flores_devtest + - sacrebleu_wmt20 + - sacrebleu_wmt18 +``` + +### Monolingual corpus +It's almost always a good idea to use back-translations to augment training data and to use monolingual corpus to augment data for decoding by the teachers, especially for low-resource languages. The only limitation is probably available computational resources. + +Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`. +I usually use [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt +because thye are relatively clean and we have an automatic downloading for them. +``` + # to be translated by the ensemble of teacher models + mono-src: + - news-crawl_news.2020 + - news-crawl_news.2019 + ... + # to be translated by the backward model to augment teacher corpus with back-translations + mono-trg: + - news-crawl_news.2020 + - news-crawl_news.2019 + ... +``` + +### Custom datasets + +It is also possible to use manually downloaded datasets with prefix `custom_`. + +Find more details about the supported dataset importers [here](data.md). + +## 3. Configure data cleaning + +To use the default data cleaining pipline set: +``` + use-opuscleaner: false +``` +Make sure the language is present in [clean_parallel](/pipeline/clean/tools/clean_parallel.py#L19) script. + +For more advanced cleaning and using OpusCleaner look at the [Data cleaning](cleaning.md) doc. + +### Bicleaner +It is recommended to use Bicleaner ML models to filter noisy data. +Check that the bicleaner-ai model is [available](https://object.pouta.csc.fi/OPUS-ELRC-3069-wikipedia_health) +and add filtering thresholds to the config. + +- `0.5` should be a good default value. +- Noisier datasets like OpenSubtitles should have higher threshold. +- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned. + +``` + bicleaner: + default-threshold: 0.5 + dataset-thresholds: + opus_CCAligned/v1: 0.7 + opus_OpenSubtitles/v2018: 0.8 + opus_ParaCrawl/v8: 0 + ... +``` + +## 4. Set hyperparameters + +The pipeline supports overriding the default [Marian settings](https://marian-nmt.github.io/docs/cmd/marian/) in the training config. + +### Model training +I often increase early stopping for teachers to make sure the training converges. +However, this depends on language and might not bring much benefit but will make the training longer. +So, you can start with `early-stopping: 20`, monitor the training and increase if it stop too early. +``` +marian-args: +# these configs override pipeline/train/configs + training-backward: + # change based on available training data + after: 10e + training-teacher-base: + # remove for low resource languages or if training without augmentation + after: 2e + early-stopping: 20 + training-teacher-finetuned: + early-stopping: 40 +``` -### Optimizaiton +### Decoding (translation) -`mini-batch-words` can be set depending on GPUs and the number of teachers +`mini-batch-words` can be set depending on available GPU memory and the number of teachers. +It affects the batch size and decoding speed. ``` marian-args: ... @@ -23,9 +171,10 @@ marian-args: mini-batch-words: 1000 ``` -### Half precision decoding +#### Half precision decoding -Make sure to use it only for teacher models and on GPUs that support it . +Make sure to use it only for teacher models and on GPUs that support it. +Is speed up decoding but can slighly decrease quality ``` marian-args: ... @@ -34,93 +183,75 @@ marian-args: precision: float16 ``` -## Mozilla Slurm cluster +## 5. Run the pipeline -I usually set just one GPU partition per run in the [cluster config](/pipeline/train/train.sh). It simplifies configuration and monitoring. +Follow the instructions that correspond to the workflow manager you will be using +([Taskcluster](task-cluster.md), [Snakemake](snakemake.md)). -Make sure to not set `precision: float16` on `txp` partition. +### Hardware specific configuaiton + +The Marian workspace is usually safe to set to about 3/4 of available GPU memory +(in a [profile for Snakemake](/pipeline/train/train.sh) and throughout the ci steps in Task cluster). +### Taskcluster -## Finding datasets +Follow [this guide](task-cluster.md) to run the pipeline on Taskcluster. -### Parallel corpus for training -1. Go to [opus](https://opus.nlpl.eu/) and see how much data is available for the language pair -2. Go to [paracrawl](https://paracrawl.eu/) and see if it's available there -3. Go to [statmt22](https://www.statmt.org/wmt22/translation-task.html), [statmt21](https://www.statmt.org/wmt21/translation-task.html) etc. and check if the language pair participated in the competition. If yes, there's a good chance some data is available for training. -4. It's hard to say how much data is required to train something useful. My guess would be at least 10 million sentences. Ideally 100M+. -5. Use [find-corpus](/utils/find-corpus.py) tool to get opus datasets and copy to `datasets.train` section in the [prod config](/configs/config.prod.yml). -Example: +You can run it up to a specific step using the config setting: ``` -conda env create -f envs/corpus.yml -conda activate corpus -python utils/find-corpus.py en ru opus +target-stage: train-teacher ``` -4. In the same way obtain and copy mtdata datasets `python utils/find-corpus.py en ru mtdata` -5. Look what's there and remove old versions of datasets (for example there should be only mtdata paracrawl v9 left like `mtdata_ParaCrawl-paracrawl-9-eng-swe`) -6. Deduplicate datasets between opus and mtdata (for example, remove `opus_ParaCrawl/v8`). If the versions are the same I prefer opus ones as a more stable resource. - -### Evaluation datasets -Use `python utils/find-corpus.py en ru sacrebleu` first. There might be some statmt datasets available. For example `sacrebleu_wmt20`. -Add some datasets for validation while training to `datasets.devtest` and other datasets for evaluation to `datasets.test`. +### Snakemake -Flores dataset is available for 100 languages, so it's always a good idea to add `flores_dev` to `datasets.devtest` and `flores_devtest` to `datasets.test` - -Make sure that training, validation and evaluation datasets are different. +After everything is configured do `make run`. It will compile Marian and other tools first which is important to do on the target machine in cluster mode. -### Monolingual corpus -It's almost always a good idea to use back translations to augment training data and to use monolingual corpus to augment data for decoding by the teachers, especially for low-resource languages. The only limitation is probably available computational resources. +If you want to inspect data first, run +``` +make run TARGET=merge_corpus +``` -Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`. I usually use [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt. Example: `news-crawl_news.2020` +Find more details in the [Snakemake doc](snakemake.md). -### Custom datasets +#### Mozilla Slurm cluster -It is also possible to use manually downloaded datasets with prefix `custom_`. +I usually set just one GPU partition per run in the [cluster config](/pipeline/train/train.sh). It simplifies configuration and monitoring. -## Cleaning +Make sure to not set `precision: float16` on `txp` partition. -Make sure the language is present in [clean_parallel](/pipeline/clean/tools/clean_parallel.py#L19) script. +## 6. Monitor progress -It is recommended to use bicleaner for noisy data like OpenSubtitles. Check that the bicleaner model is available and add `opus_OpenSubtitles/v2018: 0.8` to `experiment.bicleaner.dataset-thresholds` section of the prod config. Set to 0 to skip cleaning explicitly, for example for ParaCrawl that comes already cleaned. +You can check training logs to see Marian output or run Tensorboard to look at training curves (currently requires restarting after a new model was added, because the tool that converts Marian logs to Tensorboard doesn't do it automatically). -You can also add some dataset specific fixes like detokenizaiton [here](/pipeline/clean/fixes). +Also, check `models///evaluation` folder to see BLEU and chrF numbers on evaluation datasets. -## Running (Snakemake) -After everything is configured do `make run`. It will compile Marian and other tools first which is important to do on the target machine in cluster mode. +## Troubleshooting -Then it will start downloading the data. It often fails on some datasets either because of hitting the rate limits of the servers or because some resources are just unavailable. It's a good idea to restart several times and then after inspecting the logs remove broken datasets from the config. +### Dataset downloading fails -When datasets are downloaded, cleaning procedures start. +Sometime external resources we download the dataset from are unavailable. +Retry the downloading steps. +If it still fails, remove those datasets from the config. +Taskcluster retries automatically. -If you want to inspect data first, run `make run TARGET=merge_corpus` +### Out-of-memory -## Training +Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM. +For very high-resource languages like French it can happen even earlier, on the backward/teacher training stage. +The workaround is to remove `--shuffle-in-ram` from the [training script](/pipeline/train/train.sh) +and add `--shuffle batches` to the student [training script](/pipeline/train/train.sh). +More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21). -### Hyperparameters -I usually increase early stopping for teachers to make sure the models converge. -``` -marian-args: -# these configs override pipeline/train/configs - training-backward: - # change based on available training data - after: 10e - training-teacher-base: - # remove for low resource languages or if training without augmentation - after: 2e - early-stopping: 20 - training-teacher-finetuned: - early-stopping: 40 -``` +### Out of GPU memory -### Monitoring +Reduce the Marian workspace or batch size. -You can check training logs to see Marian output or run Tensorboard to look at training curves (currently requires restarting after a new model was added, because the tool that converts Marian logs to Tensorboard doesn't do it automatically). +### Out of disk -Also, check `models///evaluation` folder to see BLEU and chrF numbers on evaluation datasets. +It happens on Taskcluster, because we train on increasingly large datasets especially close to the end of the pipeline. +Just increase the disk size, it's cheap compared to the GPUs. -### Out-of-memory issues -Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM. For very high-resource languages like French it can happen in a teacher training state. The workaround is to remove `--shuffle-in-ram` from the [training script](/pipeline/train/train.sh) and add `--shuffle batches` to the student [training script](/pipeline/train/train.sh). More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21). From 074163c4076a3eac590a6072dd9cf2140610252b Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Tue, 31 Oct 2023 15:09:37 -0700 Subject: [PATCH 02/26] Fix docs --- docs/data.md | 8 ++---- docs/training-guide.md | 65 +++++++++++++++++++++++++++++------------- 2 files changed, 47 insertions(+), 26 deletions(-) diff --git a/docs/data.md b/docs/data.md index 8e7e4468f..0984ab823 100644 --- a/docs/data.md +++ b/docs/data.md @@ -1,8 +1,4 @@ -# Data - -This section includes instructions on how to find and configure datasets and cleaning procedures. - -## Dataset importers +# Dataset importers Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml). @@ -36,7 +32,7 @@ python utils/find-corpus.py en ru sacrebleu ``` Make sure to check licenses of the datasets before using them. -### Adding a new importer +## Adding a new importer Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `.sh` and accepts the same parameters as the other scripts from the same folder. diff --git a/docs/training-guide.md b/docs/training-guide.md index 3538d94b4..95d06232d 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -13,7 +13,7 @@ Considerations: - The size of the parallel corpus on [OPUS](https://opus.nlpl.eu/) - Availability of monolingual data. The pipeline requires monolingual data in both source and target languages. Currently we support automatic donwloading only for [news crawl](https://data.statmt.org/news-crawl/) -- Availability of [bicleaner-ai models](https://object.pouta.csc.fi/OPUS-ELRC-3069-wikipedia_health) +- Availability of [bicleaner-ai models](https://github.com/bitextor/bicleaner-ai-data/releases) Set the language pair and a name of the experiment in the config: ``` @@ -65,6 +65,7 @@ python utils/find-corpus.py en ru sacrebleu ``` - Use some datasets for validation while training (`datasets.devtest` section) and others for evaluation (`datasets.test`). - Flores dataset is available for 100 languages, so it's always a good idea to add `flores_dev` for validation and `flores_devtest` for the final evaluation of the model. +- Some OPUS and mtdata datasets provide dev and devtest versions, so it's a good idea to add them to evaluation. - Make sure that training, validation and evaluation datasets are different. ``` @@ -81,11 +82,14 @@ python utils/find-corpus.py en ru sacrebleu ``` ### Monolingual corpus -It's almost always a good idea to use back-translations to augment training data and to use monolingual corpus to augment data for decoding by the teachers, especially for low-resource languages. The only limitation is probably available computational resources. +It is recommended to always use back-translations to augment training data and to use +monolingual corpus to augment data for decoding by the teachers, even for high resource lanugages. +It will be especially useful for low-resource ones though. +The only limitation is probably available computational resources. Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`. I usually use [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt -because thye are relatively clean and we have an automatic downloading for them. +because they are relatively clean and we have an automatic downloading for them. ``` # to be translated by the ensemble of teacher models mono-src: @@ -107,17 +111,17 @@ Find more details about the supported dataset importers [here](data.md). ## 3. Configure data cleaning -To use the default data cleaining pipline set: +To use the default data cleaning pipeline set: ``` use-opuscleaner: false ``` Make sure the language is present in [clean_parallel](/pipeline/clean/tools/clean_parallel.py#L19) script. -For more advanced cleaning and using OpusCleaner look at the [Data cleaning](cleaning.md) doc. +For more advanced cleaning and for using OpusCleaner look at the [Data cleaning](cleaning.md) doc. ### Bicleaner It is recommended to use Bicleaner ML models to filter noisy data. -Check that the bicleaner-ai model is [available](https://object.pouta.csc.fi/OPUS-ELRC-3069-wikipedia_health) +Check that the bicleaner-ai model is [available](https://github.com/bitextor/bicleaner-ai-data/releases) and add filtering thresholds to the config. - `0.5` should be a good default value. @@ -140,8 +144,8 @@ The pipeline supports overriding the default [Marian settings](https://marian-nm ### Model training I often increase early stopping for teachers to make sure the training converges. -However, this depends on language and might not bring much benefit but will make the training longer. -So, you can start with `early-stopping: 20`, monitor the training and increase if it stop too early. +However, it depends on the language and might not bring much benefit but will make the training longer. +So, you can start with `early-stopping: 20`, monitor the training and increase it if the model stops training too early. ``` marian-args: # these configs override pipeline/train/configs @@ -159,7 +163,7 @@ marian-args: ### Decoding (translation) `mini-batch-words` can be set depending on available GPU memory and the number of teachers. -It affects the batch size and decoding speed. +It affects the batch size and decoding speed for the `traslate` steps. ``` marian-args: ... @@ -174,7 +178,7 @@ marian-args: #### Half precision decoding Make sure to use it only for teacher models and on GPUs that support it. -Is speed up decoding but can slighly decrease quality +It speeds up decoding but can slightly decrease quality. ``` marian-args: ... @@ -188,17 +192,20 @@ marian-args: Follow the instructions that correspond to the workflow manager you will be using ([Taskcluster](task-cluster.md), [Snakemake](snakemake.md)). -### Hardware specific configuaiton +Find the full description of the pipeline steps [here](pipeline-steps.md). + +### Cluster specific configuaiton The Marian workspace is usually safe to set to about 3/4 of available GPU memory (in a [profile for Snakemake](/pipeline/train/train.sh) and throughout the ci steps in Task cluster). - +Setting a higher value speeds up training but might lead to out of GPU memory error. ### Taskcluster Follow [this guide](task-cluster.md) to run the pipeline on Taskcluster. -You can run it up to a specific step using the config setting: +You can run it up to a specific step using a config setting. +For example to only train the teacher model: ``` target-stage: train-teacher ``` @@ -222,6 +229,20 @@ Make sure to not set `precision: float16` on `txp` partition. ## 6. Monitor progress +### Logs + +Look at the logs of the pipeline steps and +specifically at `train.log` for the training steps (`train-...`, `finetune-...`). + +### Metrics + +Check logs or output files `*.metrics` for `evaluate` steps to see the BLEU and chrF metrics calculated on evaluation datasets. + +For Snakemake check `models///evaluation` folder. + + +### Tensorboard + It is possible to look at the training graphs in Tensorboard. #### Taskcluster @@ -233,9 +254,8 @@ LOGS_TASK_GROUP=DClbX0cjSCeQuoE1fW-Ehw make download-logs ##### Snakemake Adjust the path to match the model directories in makefile `tensorboard` command and remove `--offline` to automtically update while training. -#### Tensorboard +#### Run server -Run Tensorboard ``` make tensorboard ``` @@ -247,12 +267,17 @@ Then go to `http://localhost:6006` in the browser Known issue: the [marian-tensorboard](https://github.com/marian-nmt/marian-tensorboard) tool we're using parses the trainig logs only for the student models and validation logs for all models for some reason. -#### Metrics +## 7. Download the final model -Check logs or output of `evaluate` steps to see the BLEU and chrF metrics for evaluation datasets. - -For Snakemake check `models///evaluation` folder. +The small quantized model is available in bergamot-translator compatible format as an output of the `export` step. +It includes three files: model, vocab and shortlist. +For example: +``` +model.ruen.intgemm.alphas.bin.gz +lex.50.50.ruen.s2t.bin.gz +vocab.ruen.spm.gz +``` ## Troubleshooting @@ -268,7 +293,7 @@ Taskcluster retries automatically. Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM. For very high-resource languages like French it can happen even earlier, on the backward/teacher training stage. The workaround is to remove `--shuffle-in-ram` from the [training script](/pipeline/train/train.sh) -and add `--shuffle batches` to the student [training script](/pipeline/train/train.sh). +and add `--shuffle batches` instead. More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21). From 01b5fc9afad08c75371b4aa3a529d392f62c438c Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Tue, 31 Oct 2023 15:55:24 -0700 Subject: [PATCH 03/26] Add index file --- docs/index.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 docs/index.md diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 000000000..817fd3e91 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,18 @@ +# Firefox Translations training +Training pipelines for Firefox Translations machine translation models. + +[Training guide](training-guide.md) + +[Cleaning](cleaning.md) + +[Datasets](data.md) + +[Pipeline steps](pipeline-steps.md) + +Workflow managers: +- [Taskcluster](task-cluster.md) +- [Snakemake](snakemake.md) + +[Development](development.md) + +[References](references.md) From 839cd84194c9252d09a8b07cc84bce1486e31a19 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Tue, 31 Oct 2023 15:57:33 -0700 Subject: [PATCH 04/26] Remove header --- docs/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/index.md b/docs/index.md index 817fd3e91..2bd08fb07 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,4 +1,3 @@ -# Firefox Translations training Training pipelines for Firefox Translations machine translation models. [Training guide](training-guide.md) From 5cbbf1113038750071152064cf000bfb8fe59c27 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Wed, 1 Nov 2023 11:41:45 -0700 Subject: [PATCH 05/26] Fix docs link --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2a436604b..427362872 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ power the Firefox web page translation starting with version 118. The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. -[Documentation](/docs) +[Documentation](https://mozilla.github.io/firefox-translations-training/) ## Pipeline From 8a3121446dfdefbd90401f3b05108032062ee6a7 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Wed, 1 Nov 2023 11:41:59 -0700 Subject: [PATCH 06/26] Remove tensorboard section --- docs/snakemake.md | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/docs/snakemake.md b/docs/snakemake.md index 1fb9fc2d4..55d5850ac 100644 --- a/docs/snakemake.md +++ b/docs/snakemake.md @@ -284,16 +284,3 @@ The main directories inside `SHARED_ROOT` are: │ └ ru-en │ └ test │ └ clean_corpus.log - - -## Utilities - -### Tensorboard - -To see training graphs run tensorboard: - -``` -make install-tensorboard -make tensorboard -``` -Then port forward 6006. From f7cb17690746457589571ad2cb6123a6e01412ab Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Wed, 1 Nov 2023 12:18:26 -0700 Subject: [PATCH 07/26] Add theme --- docs/_config.yml | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 docs/_config.yml diff --git a/docs/_config.yml b/docs/_config.yml new file mode 100644 index 000000000..190bd9a4c --- /dev/null +++ b/docs/_config.yml @@ -0,0 +1,4 @@ +remote_theme: just-the-docs/just-the-docs +#color_scheme: dark +title: Firefox Translations Training +description: Training pipelines for Firefox Translations From 0fde7dc7e20f2da9faffd5ecb218bd7ce4b264f7 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Wed, 1 Nov 2023 12:48:10 -0700 Subject: [PATCH 08/26] Update navigation --- docs/_config.yml | 5 +++++ docs/cleaning.md | 6 ++++++ docs/data.md | 6 ++++++ docs/development.md | 6 ++++++ docs/index.md | 39 ++++++++++++++++++++++++++++++--------- docs/orchestrators.md | 17 +++++++++++++++++ docs/pipeline-steps.md | 5 +++++ docs/references.md | 6 ++++++ docs/snakemake.md | 7 +++++++ docs/task-cluster.md | 7 +++++++ docs/training-guide.md | 7 +++++++ 11 files changed, 102 insertions(+), 9 deletions(-) create mode 100644 docs/orchestrators.md diff --git a/docs/_config.yml b/docs/_config.yml index 190bd9a4c..3b507a5c5 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -2,3 +2,8 @@ remote_theme: just-the-docs/just-the-docs #color_scheme: dark title: Firefox Translations Training description: Training pipelines for Firefox Translations +# Aux links for the upper right navigation +aux_links: + "GitHub": + - "https://github.com/mozilla/firefox-translations-training" + diff --git a/docs/cleaning.md b/docs/cleaning.md index d38fb065e..2eec77e37 100644 --- a/docs/cleaning.md +++ b/docs/cleaning.md @@ -1,3 +1,9 @@ +--- +layout: default +title: Data cleaning +nav_order: 5 +--- + # Data cleaning Making datasets less noisy to improve quality of translation. diff --git a/docs/data.md b/docs/data.md index 0984ab823..d00a4ea06 100644 --- a/docs/data.md +++ b/docs/data.md @@ -1,3 +1,9 @@ +--- +layout: default +title: Datasets +nav_order: 4 +--- + # Dataset importers Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml). diff --git a/docs/development.md b/docs/development.md index 6f004281e..52f63c63b 100644 --- a/docs/development.md +++ b/docs/development.md @@ -1,3 +1,9 @@ +--- +layout: default +title: Development +nav_order: 7 +--- + # Development ## Architecture diff --git a/docs/index.md b/docs/index.md index 2bd08fb07..b3950b09e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,17 +1,38 @@ +--- +layout: default +title: Home +nav_order: 1 +description: "Firefox Translations Training documentation." +permalink: / +--- + +# Firefox Translations training Training pipelines for Firefox Translations machine translation models. -[Training guide](training-guide.md) +The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository, +compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and +power the Firefox web page translation starting with version 118. + +The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. + +## Training pipeline -[Cleaning](cleaning.md) +The pipeline is capable of training a translation model for a language pair end to end. +Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. +Some settings, especially low resource languages might require extra tuning. -[Datasets](data.md) +We use fast translation engine [Marian](https://marian-nmt.github.io). -[Pipeline steps](pipeline-steps.md) +## Learning resources -Workflow managers: -- [Taskcluster](task-cluster.md) -- [Snakemake](snakemake.md) +- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/) +- [Model training guide](training-guide.md) - practical advice on how to use the pipeline +- [Reference papers](references.md) -[Development](development.md) -[References](references.md) +## Acknowledgements +This project uses materials developed by: +- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303 +- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546] +- OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/)) +- Many other open source projects and research papers (see [References](references.md)) diff --git a/docs/orchestrators.md b/docs/orchestrators.md new file mode 100644 index 000000000..c5a608375 --- /dev/null +++ b/docs/orchestrators.md @@ -0,0 +1,17 @@ +--- +layout: default +title: Orchestrators +nav_order: 6 +has_children: true +has_toc: false +--- + +# Orchestrators + +An orchestrator is responsible for workflow management and parallelization. + +- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI. + It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability. + [Usage instructions](task-cluster.md). +- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that can be used to run the pipeline locally or on a Slurm cluster. + [Usage instructions](snakemake.md). (The integration will not be actively maintained, since Mozilla is switching to Taskcluster) diff --git a/docs/pipeline-steps.md b/docs/pipeline-steps.md index 398b0317e..7cc84fbe2 100644 --- a/docs/pipeline-steps.md +++ b/docs/pipeline-steps.md @@ -1,3 +1,8 @@ +--- +layout: default +title: Pipeline steps +nav_order: 3 +--- # Pipeline steps diff --git a/docs/references.md b/docs/references.md index 751f6ca3e..2acac3f9a 100644 --- a/docs/references.md +++ b/docs/references.md @@ -1,3 +1,9 @@ +--- +layout: default +title: References +nav_order: 8 +--- + # References Here is a list of selected publications on which the training pipeline is based. diff --git a/docs/snakemake.md b/docs/snakemake.md index 55d5850ac..8344f9f95 100644 --- a/docs/snakemake.md +++ b/docs/snakemake.md @@ -1,3 +1,10 @@ +--- +layout: default +title: Snakemake +nav_order: 2 +parent: Orchestrators +--- + # Snakemake This section included the instructions on how to run the pipeline diff --git a/docs/task-cluster.md b/docs/task-cluster.md index fd64eecc7..e1bd57b70 100644 --- a/docs/task-cluster.md +++ b/docs/task-cluster.md @@ -1,3 +1,10 @@ +--- +layout: default +title: Taskcluster +nav_order: 1 +parent: Orchestrators +--- + # Taskcluster [Taskcluster](https://taskcluster.net/) is a Mozilla task execution framework. It powers Firefox CI and diff --git a/docs/training-guide.md b/docs/training-guide.md index 95d06232d..9cec38111 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -1,3 +1,10 @@ +--- +layout: default +title: Model training guide +nav_order: 2 +has_toc: true +--- + # Model training guide A step-by-step guide on how to train a translation model. From df4e819aa3c41ea59c2c3ec1cad3637c425ee70e Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Wed, 1 Nov 2023 13:08:35 -0700 Subject: [PATCH 09/26] Add logo --- docs/_config.yml | 4 +++- docs/img/logo.svg | 4 ++++ docs/training-guide.md | 1 - 3 files changed, 7 insertions(+), 2 deletions(-) create mode 100644 docs/img/logo.svg diff --git a/docs/_config.yml b/docs/_config.yml index 3b507a5c5..8208fef4c 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -1,7 +1,9 @@ remote_theme: just-the-docs/just-the-docs #color_scheme: dark title: Firefox Translations Training -description: Training pipelines for Firefox Translations +description: Documentaiton for the Firefox Translations training pipelines +heading_anchors: true +favicon_ico: "img/logo.svg" # Aux links for the upper right navigation aux_links: "GitHub": diff --git a/docs/img/logo.svg b/docs/img/logo.svg new file mode 100644 index 000000000..fdc83d310 --- /dev/null +++ b/docs/img/logo.svg @@ -0,0 +1,4 @@ + + + + diff --git a/docs/training-guide.md b/docs/training-guide.md index 9cec38111..4a5b53458 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -2,7 +2,6 @@ layout: default title: Model training guide nav_order: 2 -has_toc: true --- # Model training guide From 0a2ae9b80b4efcf956a49597b30b0c2af771a3a2 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Wed, 1 Nov 2023 13:34:54 -0700 Subject: [PATCH 10/26] Use absolute links --- docs/_config.yml | 1 + docs/cleaning.md | 8 ++++---- docs/data.md | 4 ++-- docs/pipeline-steps.md | 6 +++--- docs/task-cluster.md | 2 +- docs/training-guide.md | 8 ++++---- 6 files changed, 15 insertions(+), 14 deletions(-) diff --git a/docs/_config.yml b/docs/_config.yml index 8208fef4c..1117f352c 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -3,6 +3,7 @@ remote_theme: just-the-docs/just-the-docs title: Firefox Translations Training description: Documentaiton for the Firefox Translations training pipelines heading_anchors: true +# doesn't work favicon_ico: "img/logo.svg" # Aux links for the upper right navigation aux_links: diff --git a/docs/cleaning.md b/docs/cleaning.md index 2eec77e37..df0948bbc 100644 --- a/docs/cleaning.md +++ b/docs/cleaning.md @@ -19,7 +19,7 @@ Config setting: ### Dataset fixing Some datasets require fixes like detokenization. -Dataset and language specific fixes are implemented in [/pipeline/clean/fixes](/pipeline/clean/fixes). +Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/pipeline/clean/fixes). Naming convention: - `.sh` for parallel dataset cleaning - `..sh` for language specific cleaning of parallel or monolingual dataset @@ -27,7 +27,7 @@ Naming convention: ### Cleaning scripts -Make sure the language is present in [clean_parallel](/pipeline/clean/tools/clean_parallel.py#L19) script. +Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/pipeline/clean/tools/clean_parallel.py#L19) script. ### Bicleaner @@ -86,7 +86,7 @@ Copy JSON files for the produced filters `data/train-parts/*.filter.json` to ### Default config If no custom config was specifed for the dataset, -the [default config template](/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. +the [default config template](https://github.com/mozilla/firefox-translations-training/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. Modify if needed. Some rules require specifying source or target language. The `` and `` in the template will be automatically replaced with the trained language pair. @@ -95,4 +95,4 @@ The generated default config will be copied to the target dataset cleaning direc ### Running Enable OpusCleaner in the training pipeline config and run the pipeline as usual. -OpusCleaner will replace the default [clean-corpus](/pipeline/clean/clean-corpus.sh) script. +OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/pipeline/clean/clean-corpus.sh) script. diff --git a/docs/data.md b/docs/data.md index d00a4ea06..bfb3ec029 100644 --- a/docs/data.md +++ b/docs/data.md @@ -27,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da [Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html) Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz" -You can also use [find-corpus](/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config. +You can also use [find-corpus](https://github.com/mozilla/firefox-translations-traininghttps://github.com/mozilla/firefox-translations-training/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config. Set up a local [poetry](https://python-poetry.org/) environment. ``` @@ -40,5 +40,5 @@ Make sure to check licenses of the datasets before using them. ## Adding a new importer -Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `.sh` +Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/pipeline/data/importers/mono) which is named as `.sh` and accepts the same parameters as the other scripts from the same folder. diff --git a/docs/pipeline-steps.md b/docs/pipeline-steps.md index 7cc84fbe2..aa4b77b7a 100644 --- a/docs/pipeline-steps.md +++ b/docs/pipeline-steps.md @@ -15,14 +15,14 @@ Step | Description | Bottleneck | Comments --- | --- | --- | --- Installation | Installing dependencies and compiling | CPU | Takes ~1 hour Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation. -Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py). +Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/pipeline/clean/tools/clean_parallel.py). Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning). Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk | Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU | Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece). Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others. -Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size. -Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size. +Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size. +Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size. Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode. Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive. Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization. diff --git a/docs/task-cluster.md b/docs/task-cluster.md index e1bd57b70..9e18031dc 100644 --- a/docs/task-cluster.md +++ b/docs/task-cluster.md @@ -86,7 +86,7 @@ For example, to download, clean and merge the training corpus use: ``` target-stage: merge-corpus ``` -that corresponds to `stage: merge-corpus` in [/taskcluster/ci/merge-corpus/kind.yml](/taskcluster/ci/merge-corpus/kind.yml): +that corresponds to `stage: merge-corpus` in [/taskcluster/ci/merge-corpus/kind.yml](https://github.com/mozilla/firefox-translations-training/taskcluster/ci/merge-corpus/kind.yml): ``` tasks: merge-corpus: diff --git a/docs/training-guide.md b/docs/training-guide.md index 4a5b53458..ded5edf93 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -121,7 +121,7 @@ To use the default data cleaning pipeline set: ``` use-opuscleaner: false ``` -Make sure the language is present in [clean_parallel](/pipeline/clean/tools/clean_parallel.py#L19) script. +Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/pipeline/clean/tools/clean_parallel.py#L19) script. For more advanced cleaning and for using OpusCleaner look at the [Data cleaning](cleaning.md) doc. @@ -203,7 +203,7 @@ Find the full description of the pipeline steps [here](pipeline-steps.md). ### Cluster specific configuaiton The Marian workspace is usually safe to set to about 3/4 of available GPU memory -(in a [profile for Snakemake](/pipeline/train/train.sh) and throughout the ci steps in Task cluster). +(in a [profile for Snakemake](https://github.com/mozilla/firefox-translations-training/pipeline/train/train.sh) and throughout the ci steps in Task cluster). Setting a higher value speeds up training but might lead to out of GPU memory error. ### Taskcluster @@ -229,7 +229,7 @@ Find more details in the [Snakemake doc](snakemake.md). #### Mozilla Slurm cluster -I usually set just one GPU partition per run in the [cluster config](/pipeline/train/train.sh). It simplifies configuration and monitoring. +I usually set just one GPU partition per run in the [cluster config](https://github.com/mozilla/firefox-translations-training/pipeline/train/train.sh). It simplifies configuration and monitoring. Make sure to not set `precision: float16` on `txp` partition. @@ -298,7 +298,7 @@ Taskcluster retries automatically. Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM. For very high-resource languages like French it can happen even earlier, on the backward/teacher training stage. -The workaround is to remove `--shuffle-in-ram` from the [training script](/pipeline/train/train.sh) +The workaround is to remove `--shuffle-in-ram` from the [training script](https://github.com/mozilla/firefox-translations-training/pipeline/train/train.sh) and add `--shuffle batches` instead. More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21). From 7f16f62eae11f2d8344c63ae093c76d2d3ef401b Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Wed, 1 Nov 2023 14:08:58 -0700 Subject: [PATCH 11/26] Fix code links --- docs/cleaning.md | 8 ++++---- docs/data.md | 6 +++--- docs/pipeline-steps.md | 6 +++--- docs/task-cluster.md | 2 +- docs/training-guide.md | 10 +++++----- 5 files changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/cleaning.md b/docs/cleaning.md index df0948bbc..e29a1d035 100644 --- a/docs/cleaning.md +++ b/docs/cleaning.md @@ -19,7 +19,7 @@ Config setting: ### Dataset fixing Some datasets require fixes like detokenization. -Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/pipeline/clean/fixes). +Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes). Naming convention: - `.sh` for parallel dataset cleaning - `..sh` for language specific cleaning of parallel or monolingual dataset @@ -27,7 +27,7 @@ Naming convention: ### Cleaning scripts -Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/pipeline/clean/tools/clean_parallel.py#L19) script. +Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script. ### Bicleaner @@ -86,7 +86,7 @@ Copy JSON files for the produced filters `data/train-parts/*.filter.json` to ### Default config If no custom config was specifed for the dataset, -the [default config template](https://github.com/mozilla/firefox-translations-training/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. +the [default config template](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. Modify if needed. Some rules require specifying source or target language. The `` and `` in the template will be automatically replaced with the trained language pair. @@ -95,4 +95,4 @@ The generated default config will be copied to the target dataset cleaning direc ### Running Enable OpusCleaner in the training pipeline config and run the pipeline as usual. -OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/pipeline/clean/clean-corpus.sh) script. +OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/clean-corpus.sh) script. diff --git a/docs/data.md b/docs/data.md index bfb3ec029..39f59b58d 100644 --- a/docs/data.md +++ b/docs/data.md @@ -6,7 +6,7 @@ nav_order: 4 # Dataset importers -Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml). +Dataset importers can be used in `datasets` sections of the [training config](/tree/main/configs/config.test.yml). Example: ``` @@ -27,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da [Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html) Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz" -You can also use [find-corpus](https://github.com/mozilla/firefox-translations-traininghttps://github.com/mozilla/firefox-translations-training/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config. +You can also use [find-corpus](https://github.com/mozilla/firefox-translations-traininghttps://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config. Set up a local [poetry](https://python-poetry.org/) environment. ``` @@ -40,5 +40,5 @@ Make sure to check licenses of the datasets before using them. ## Adding a new importer -Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/pipeline/data/importers/mono) which is named as `.sh` +Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/mono) which is named as `.sh` and accepts the same parameters as the other scripts from the same folder. diff --git a/docs/pipeline-steps.md b/docs/pipeline-steps.md index aa4b77b7a..73df3d126 100644 --- a/docs/pipeline-steps.md +++ b/docs/pipeline-steps.md @@ -15,14 +15,14 @@ Step | Description | Bottleneck | Comments --- | --- | --- | --- Installation | Installing dependencies and compiling | CPU | Takes ~1 hour Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation. -Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/pipeline/clean/tools/clean_parallel.py). +Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py). Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning). Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk | Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU | Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece). Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others. -Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size. -Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size. +Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size. +Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size. Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode. Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive. Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization. diff --git a/docs/task-cluster.md b/docs/task-cluster.md index 9e18031dc..ced7491f1 100644 --- a/docs/task-cluster.md +++ b/docs/task-cluster.md @@ -37,7 +37,7 @@ We use [Taskcluster taskgraph](https://taskcluster-taskgraph.readthedocs.io/en/l ![Choose action](img/tc-train-action.png) -6. Copy a config prepared in advance and press "train". See the example TC config [here](/configs/tc.prod.yml). +6. Copy a config prepared in advance and press "train". See the example TC config [here](/tree/main/configs/tc.prod.yml). You can find directions on how to configure training in the [Model training guide](training-guide.md). ![Start training](img/tc-train.png) diff --git a/docs/training-guide.md b/docs/training-guide.md index ded5edf93..8ccde0875 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -9,7 +9,7 @@ nav_order: 2 A step-by-step guide on how to train a translation model. The configuration of the training run happens mostly in the training configuration file. -Look at the examples of the full production configs for [Taskcluster](/configs/tc.prod.yml) and [Snakemake](/configs/config.prod.yml). +Look at the examples of the full production configs for [Taskcluster](/tree/main/configs/tc.prod.yml) and [Snakemake](/tree/main/configs/config.prod.yml). ## 1. Choose a language @@ -121,7 +121,7 @@ To use the default data cleaning pipeline set: ``` use-opuscleaner: false ``` -Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/pipeline/clean/tools/clean_parallel.py#L19) script. +Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script. For more advanced cleaning and for using OpusCleaner look at the [Data cleaning](cleaning.md) doc. @@ -203,7 +203,7 @@ Find the full description of the pipeline steps [here](pipeline-steps.md). ### Cluster specific configuaiton The Marian workspace is usually safe to set to about 3/4 of available GPU memory -(in a [profile for Snakemake](https://github.com/mozilla/firefox-translations-training/pipeline/train/train.sh) and throughout the ci steps in Task cluster). +(in a [profile for Snakemake](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) and throughout the ci steps in Task cluster). Setting a higher value speeds up training but might lead to out of GPU memory error. ### Taskcluster @@ -229,7 +229,7 @@ Find more details in the [Snakemake doc](snakemake.md). #### Mozilla Slurm cluster -I usually set just one GPU partition per run in the [cluster config](https://github.com/mozilla/firefox-translations-training/pipeline/train/train.sh). It simplifies configuration and monitoring. +I usually set just one GPU partition per run in the [cluster config](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh). It simplifies configuration and monitoring. Make sure to not set `precision: float16` on `txp` partition. @@ -298,7 +298,7 @@ Taskcluster retries automatically. Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM. For very high-resource languages like French it can happen even earlier, on the backward/teacher training stage. -The workaround is to remove `--shuffle-in-ram` from the [training script](https://github.com/mozilla/firefox-translations-training/pipeline/train/train.sh) +The workaround is to remove `--shuffle-in-ram` from the [training script](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) and add `--shuffle batches` instead. More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21). From 982df976280ebd1e5e2e026471a97104a275ceca Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Wed, 1 Nov 2023 15:24:42 -0700 Subject: [PATCH 12/26] Fix code links --- docs/data.md | 2 +- docs/task-cluster.md | 2 +- docs/training-guide.md | 6 ++++-- 3 files changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/data.md b/docs/data.md index 39f59b58d..0cbfc09a2 100644 --- a/docs/data.md +++ b/docs/data.md @@ -6,7 +6,7 @@ nav_order: 4 # Dataset importers -Dataset importers can be used in `datasets` sections of the [training config](/tree/main/configs/config.test.yml). +Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.test.yml). Example: ``` diff --git a/docs/task-cluster.md b/docs/task-cluster.md index ced7491f1..5873ea73e 100644 --- a/docs/task-cluster.md +++ b/docs/task-cluster.md @@ -37,7 +37,7 @@ We use [Taskcluster taskgraph](https://taskcluster-taskgraph.readthedocs.io/en/l ![Choose action](img/tc-train-action.png) -6. Copy a config prepared in advance and press "train". See the example TC config [here](/tree/main/configs/tc.prod.yml). +6. Copy a config prepared in advance and press "train". See the example TC config [here](https://github.com/mozilla/firefox-translations-training/tree/main/configs/tc.prod.yml). You can find directions on how to configure training in the [Model training guide](training-guide.md). ![Start training](img/tc-train.png) diff --git a/docs/training-guide.md b/docs/training-guide.md index 8ccde0875..960d2faff 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -9,7 +9,9 @@ nav_order: 2 A step-by-step guide on how to train a translation model. The configuration of the training run happens mostly in the training configuration file. -Look at the examples of the full production configs for [Taskcluster](/tree/main/configs/tc.prod.yml) and [Snakemake](/tree/main/configs/config.prod.yml). +Look at the examples of the full production configs for +[Taskcluster](https://github.com/mozilla/firefox-translations-training/tree/main/configs/tc.prod.yml) and +[Snakemake](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.prod.yml). ## 1. Choose a language @@ -36,7 +38,7 @@ experiment: 2. Go to [statmt22](https://www.statmt.org/wmt22/translation-task.html), [statmt21](https://www.statmt.org/wmt21/translation-task.html) etc. and check if the language pair participated in a competition. If yes, there's a good chance some extra data is available for training. -3. Use [find-corpus](/utils/find-corpus.py) tool to get OPUS datasets. +3. Use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/utils/find-corpus.py) tool to get OPUS datasets. Install [poetry](https://python-poetry.org/) first, then run: ``` make install-utils From 157d6797e7b971fe9ba905c64b129c65f102e06b Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Thu, 2 Nov 2023 15:08:18 -0700 Subject: [PATCH 13/26] Fix link --- docs/training-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/training-guide.md b/docs/training-guide.md index 960d2faff..41a956ffe 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -231,7 +231,7 @@ Find more details in the [Snakemake doc](snakemake.md). #### Mozilla Slurm cluster -I usually set just one GPU partition per run in the [cluster config](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh). It simplifies configuration and monitoring. +I usually set just one GPU partition per run in the [cluster config](https://github.com/mozilla/firefox-translations-training/tree/main/profiles/slurm-moz/config.cluster.yaml). It simplifies configuration and monitoring. Make sure to not set `precision: float16` on `txp` partition. @@ -259,7 +259,7 @@ For example for [this task group](https://firefox-ci-tc.services.mozilla.com/tas ``` LOGS_TASK_GROUP=DClbX0cjSCeQuoE1fW-Ehw make download-logs ``` -##### Snakemake +#### Snakemake Adjust the path to match the model directories in makefile `tensorboard` command and remove `--offline` to automtically update while training. #### Run server From 37a5e28f5e16a3a5e975d12e2acf7a908dcca5d2 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 16:29:39 -0700 Subject: [PATCH 14/26] Clarify what config is --- docs/training-guide.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/training-guide.md b/docs/training-guide.md index 41a956ffe..09b9f04c5 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -23,7 +23,10 @@ Considerations: Currently we support automatic donwloading only for [news crawl](https://data.statmt.org/news-crawl/) - Availability of [bicleaner-ai models](https://github.com/bitextor/bicleaner-ai-data/releases) -Set the language pair and a name of the experiment in the config: + +Copy the [example config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/tc.prod.yml) from the `/configs` directory to modify. + +Then change the language pair and the name of the experiment: ``` experiment: name: test-quality @@ -142,7 +145,7 @@ and add filtering thresholds to the config. dataset-thresholds: opus_CCAligned/v1: 0.7 opus_OpenSubtitles/v2018: 0.8 - opus_ParaCrawl/v8: 0 + opus_ParaCrawl/v9: 0 ... ``` From d63e50fda790cccec9448bd264ea594917b8f789 Mon Sep 17 00:00:00 2001 From: Evgeny Pavlov Date: Fri, 3 Nov 2023 16:30:46 -0700 Subject: [PATCH 15/26] Fix note for bicleaner Co-authored-by: Marco Castelluccio --- docs/training-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/training-guide.md b/docs/training-guide.md index 09b9f04c5..8d1ddd611 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -137,7 +137,7 @@ and add filtering thresholds to the config. - `0.5` should be a good default value. - Noisier datasets like OpenSubtitles should have higher threshold. -- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned. +- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned by bicleaner. ``` bicleaner: From 189fb006b271a649a90a92f01783cd96823339e6 Mon Sep 17 00:00:00 2001 From: Evgeny Pavlov Date: Fri, 3 Nov 2023 16:31:17 -0700 Subject: [PATCH 16/26] Fix typo Co-authored-by: Greg Tatum --- docs/_config.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_config.yml b/docs/_config.yml index 1117f352c..efeef4739 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -1,7 +1,7 @@ remote_theme: just-the-docs/just-the-docs #color_scheme: dark title: Firefox Translations Training -description: Documentaiton for the Firefox Translations training pipelines +description: Documentation for the Firefox Translations training pipelines heading_anchors: true # doesn't work favicon_ico: "img/logo.svg" From d605b770fe8cb7cf85b9b0e333cf23fe2e36d3b8 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 16:34:13 -0700 Subject: [PATCH 17/26] Fix link --- docs/data.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data.md b/docs/data.md index 0cbfc09a2..c2e664595 100644 --- a/docs/data.md +++ b/docs/data.md @@ -27,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da [Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html) Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz" -You can also use [find-corpus](https://github.com/mozilla/firefox-translations-traininghttps://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config. +You can also use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config. Set up a local [poetry](https://python-poetry.org/) environment. ``` From 8761ba5591c4b5f1295cdc0547a6650fb369514f Mon Sep 17 00:00:00 2001 From: Evgeny Pavlov Date: Fri, 3 Nov 2023 16:39:23 -0700 Subject: [PATCH 18/26] Fix mentioning of Marian Co-authored-by: Greg Tatum --- docs/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/index.md b/docs/index.md index b3950b09e..a7f6c3b58 100644 --- a/docs/index.md +++ b/docs/index.md @@ -21,7 +21,7 @@ The pipeline is capable of training a translation model for a language pair end Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. Some settings, especially low resource languages might require extra tuning. -We use fast translation engine [Marian](https://marian-nmt.github.io). +We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine . ## Learning resources From 4558e30b25d174236f519aabebe7dce957b07bc1 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 16:43:28 -0700 Subject: [PATCH 19/26] Remove "my" --- docs/training-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/training-guide.md b/docs/training-guide.md index 8d1ddd611..a1a9062c2 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -65,7 +65,7 @@ datasets: ... ``` It's hard to say how much data is required to train something useful. -My guess would be at least 10 million sentences. Ideally 100M+. +Probably, at least 10 million sentences. Ideally 100M+ to get the best quality. ### Evaluation datasets From 416f7996e0ccd5d2847b8fb4fdd5cf6bd9b9d082 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 16:47:26 -0700 Subject: [PATCH 20/26] Make note about snakemake more visible --- docs/orchestrators.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/orchestrators.md b/docs/orchestrators.md index c5a608375..f371fbbf1 100644 --- a/docs/orchestrators.md +++ b/docs/orchestrators.md @@ -10,8 +10,12 @@ has_toc: false An orchestrator is responsible for workflow management and parallelization. +Currently supported orchestrators: + - [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI. It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability. [Usage instructions](task-cluster.md). - [Snakemake](https://snakemake.github.io/) - a file based orchestrator that can be used to run the pipeline locally or on a Slurm cluster. - [Usage instructions](snakemake.md). (The integration will not be actively maintained, since Mozilla is switching to Taskcluster) + [Usage instructions](snakemake.md). + +Mozilla is currently switching to Taskcluster and the Snakemake workflow will be less actively maintained in the future. From 28c380c61c4da8c11b18d5df6def33f0c304905c Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 16:48:08 -0700 Subject: [PATCH 21/26] Fix phrasing --- docs/orchestrators.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/orchestrators.md b/docs/orchestrators.md index f371fbbf1..a4668cc69 100644 --- a/docs/orchestrators.md +++ b/docs/orchestrators.md @@ -10,7 +10,7 @@ has_toc: false An orchestrator is responsible for workflow management and parallelization. -Currently supported orchestrators: +Supported orchestrators: - [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI. It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability. From 4a9803b18417598da5bda6db79d87039a97abe41 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 17:19:00 -0700 Subject: [PATCH 22/26] Add link to bilceaner paper --- docs/references.md | 1 + docs/training-guide.md | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/references.md b/docs/references.md index 2acac3f9a..b0b19acdc 100644 --- a/docs/references.md +++ b/docs/references.md @@ -37,3 +37,4 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020 14. Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). [A Simple, Fast, and Effective Reparameterization of IBM Model 2](http://www.ark.cs.cmu.edu/cdyer/fast_valign.pdf). In Proc. of NAACL. 15. [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162) (Sennrich et al., ACL 2016) 16. [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) (Taku Kudo, 2018) +17. [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf) (Zaragoza-Bernabeu et al., LREC 2022) diff --git a/docs/training-guide.md b/docs/training-guide.md index a1a9062c2..7ae623ed9 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -135,9 +135,10 @@ It is recommended to use Bicleaner ML models to filter noisy data. Check that the bicleaner-ai model is [available](https://github.com/bitextor/bicleaner-ai-data/releases) and add filtering thresholds to the config. -- `0.5` should be a good default value. +- `0.5` should be a [good default value](https://github.com/bitextor/bicleaner-ai/wiki/How-to-train-your-Bicleaner-AI#bicleaning-a-corpus). - Noisier datasets like OpenSubtitles should have higher threshold. -- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned by bicleaner. +- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned by bicleaner + (see [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf), section 4.2.2). ``` bicleaner: From 7c31e16112d805fd2d7073760a7e6e13de99ae36 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 17:52:19 -0700 Subject: [PATCH 23/26] Add clarifications --- docs/references.md | 1 + docs/training-guide.md | 22 +++++++++++++++------- 2 files changed, 16 insertions(+), 7 deletions(-) diff --git a/docs/references.md b/docs/references.md index b0b19acdc..be9fe6b8e 100644 --- a/docs/references.md +++ b/docs/references.md @@ -38,3 +38,4 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020 15. [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162) (Sennrich et al., ACL 2016) 16. [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) (Taku Kudo, 2018) 17. [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf) (Zaragoza-Bernabeu et al., LREC 2022) +18. [Sequence-Level Knowledge Distillation](https://arxiv.org/abs/1606.07947) (Yoon Kim, Alexander M. Rush, EMNLP 2016) diff --git a/docs/training-guide.md b/docs/training-guide.md index 7ae623ed9..e07b47f9b 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -93,9 +93,14 @@ python utils/find-corpus.py en ru sacrebleu ``` ### Monolingual corpus -It is recommended to always use back-translations to augment training data and to use -monolingual corpus to augment data for decoding by the teachers, even for high resource lanugages. -It will be especially useful for low-resource ones though. +It is recommended to use back-translations to augment training data by training a model in reversed direction and then +translating a monolingual corpus in target language to the source language +(see [Improving Neural Machine Translation Models with Monolingual Data](https://aclanthology.org/P16-1009.pdf)). + +It is also important to use monolingual corpus in source language to augment data for decoding by the teachers +to improve teacher-student knowledge distillation (see [Sequence-Level Knowledge Distillation](https://arxiv.org/abs/1606.07947)). + +Those techniques are useful even for high-resource languages but especially useful for low-resource ones. The only limitation is probably available computational resources. Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`. @@ -131,13 +136,16 @@ Make sure the language is present in [clean_parallel](https://github.com/mozilla For more advanced cleaning and for using OpusCleaner look at the [Data cleaning](cleaning.md) doc. ### Bicleaner -It is recommended to use Bicleaner ML models to filter noisy data. -Check that the bicleaner-ai model is [available](https://github.com/bitextor/bicleaner-ai-data/releases) -and add filtering thresholds to the config. +It is recommended to use [Bicleaner](https://github.com/bitextor/bicleaner-ai) ML models to filter noisy data. +Bicleaner classifier scores parallel sentences from 0 to 1 where 0 means a very noisy translation and 1 is a good translation. +Most of the scores will be between 0 and 1. + +Check that the bicleaner-ai model is [available](https://github.com/bitextor/bicleaner-ai-data/releases) for you language pair +and add filtering thresholds to the config. - `0.5` should be a [good default value](https://github.com/bitextor/bicleaner-ai/wiki/How-to-train-your-Bicleaner-AI#bicleaning-a-corpus). - Noisier datasets like OpenSubtitles should have higher threshold. -- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned by bicleaner +- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned by Bicleaner (see [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf), section 4.2.2). ``` From d6c16938b900578751e90516a653ab0a0f828956 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 17:57:50 -0700 Subject: [PATCH 24/26] Add links to default training configs --- docs/training-guide.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/training-guide.md b/docs/training-guide.md index e07b47f9b..36c1251b6 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -161,6 +161,9 @@ and add filtering thresholds to the config. ## 4. Set hyperparameters The pipeline supports overriding the default [Marian settings](https://marian-nmt.github.io/docs/cmd/marian/) in the training config. +The default settings are in the `pipeline/train/configs` directory, +for example [teacher.train.yml](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) +and in the [train.sh](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) script. ### Model training I often increase early stopping for teachers to make sure the training converges. From 68cb740d8fe3173b7270108be00ec364e953cf91 Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 18:07:10 -0700 Subject: [PATCH 25/26] Add reference to bilceaner section --- docs/cleaning.md | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/docs/cleaning.md b/docs/cleaning.md index e29a1d035..a7597fe70 100644 --- a/docs/cleaning.md +++ b/docs/cleaning.md @@ -33,22 +33,8 @@ Make sure the language is present in [clean_parallel](https://github.com/mozilla ### Bicleaner It is recommended to use Bicleaner ML models to filter noisy data. -Check that the bicleaner-ai model is [available](https://object.pouta.csc.fi/OPUS-ELRC-3069-wikipedia_health) -and add filtering thresholds to the config. +See more details on how to configure it in the [Model training guide, Bicleaner section](training-guide.md/#bicleaner). -- `0.5` should be a good default value. -- Noisier datasets like OpenSubtitles should have higher threshold. -- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned. - -``` - bicleaner: - default-threshold: 0.5 - dataset-thresholds: - opus_CCAligned/v1: 0.7 - opus_OpenSubtitles/v2018: 0.8 - opus_ParaCrawl/v8: 0 - ... -``` ## OpusCleaner From 9baa0aa227313a935551efa0c21c5cac2121621a Mon Sep 17 00:00:00 2001 From: evgeny pavlov Date: Fri, 3 Nov 2023 18:15:49 -0700 Subject: [PATCH 26/26] Small fixes --- docs/training-guide.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/training-guide.md b/docs/training-guide.md index 36c1251b6..95ab774ad 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -104,8 +104,8 @@ Those techniques are useful even for high-resource languages but especially usef The only limitation is probably available computational resources. Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`. -I usually use [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt -because they are relatively clean and we have an automatic downloading for them. +Using [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt is preferable +because they are relatively clean, and the pipeline supports automatic downloading for them. ``` # to be translated by the ensemble of teacher models mono-src: @@ -140,7 +140,7 @@ It is recommended to use [Bicleaner](https://github.com/bitextor/bicleaner-ai) M Bicleaner classifier scores parallel sentences from 0 to 1 where 0 means a very noisy translation and 1 is a good translation. Most of the scores will be between 0 and 1. -Check that the bicleaner-ai model is [available](https://github.com/bitextor/bicleaner-ai-data/releases) for you language pair +Check that the bicleaner-ai model is [available](https://github.com/bitextor/bicleaner-ai-data/releases) for your language pair and add filtering thresholds to the config. - `0.5` should be a [good default value](https://github.com/bitextor/bicleaner-ai/wiki/How-to-train-your-Bicleaner-AI#bicleaning-a-corpus).