Skip to content

Commit

Permalink
Update the training guide (#239)
Browse files Browse the repository at this point in the history
* Update training guide

* Fix docs

* Add index file

* Remove header

* Fix docs link

* Remove tensorboard section

* Add theme

* Update navigation

* Add logo

* Use absolute links

* Fix code links

* Fix code links

* Fix link

* Clarify what config is

* Fix note for bicleaner

Co-authored-by: Marco Castelluccio <[email protected]>

* Fix typo

Co-authored-by: Greg Tatum <[email protected]>

* Fix link

* Fix mentioning of Marian

Co-authored-by: Greg Tatum <[email protected]>

* Remove "my"

* Make note about snakemake more visible

* Fix phrasing

* Add link to bilceaner paper

* Add clarifications

* Add links to default training configs

* Add reference to bilceaner section

* Small fixes

---------

Co-authored-by: Marco Castelluccio <[email protected]>
Co-authored-by: Greg Tatum <[email protected]>
  • Loading branch information
3 people authored Nov 6, 2023
1 parent cf51faa commit 2df0a3a
Show file tree
Hide file tree
Showing 15 changed files with 465 additions and 184 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -119,13 +119,13 @@ dag:
################################################

# OpusCleaner is a data cleaner for training corpus
# More details are in docs/opus-cleaner.md
# More details are in docs/cleaning.md
opuscleaner-ui:
poetry install --only opuscleaner
opuscleaner-server serve --host=0.0.0.0 --port=8000

# Utils to find corpus etc
install utils:
install-utils:
poetry install --only utils

# Black is a code formatter for Python files. Running this command will check that
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ power the Firefox web page translation starting with version 118.

The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.

[Documentation](/docs)
[Documentation](https://mozilla.github.io/firefox-translations-training/)

## Pipeline

Expand Down
12 changes: 12 additions & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
remote_theme: just-the-docs/just-the-docs
#color_scheme: dark
title: Firefox Translations Training
description: Documentation for the Firefox Translations training pipelines
heading_anchors: true
# doesn't work
favicon_ico: "img/logo.svg"
# Aux links for the upper right navigation
aux_links:
"GitHub":
- "https://github.com/mozilla/firefox-translations-training"

84 changes: 84 additions & 0 deletions docs/cleaning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
layout: default
title: Data cleaning
nav_order: 5
---

# Data cleaning

Making datasets less noisy to improve quality of translation.

## Regular pipeline


Config setting:
```
use-opuscleaner: false
```

### Dataset fixing

Some datasets require fixes like detokenization.
Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes).
Naming convention:
- `<dataset_name>.sh` for parallel dataset cleaning
- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
- `/` in dataset name should be replaced with `_`

### Cleaning scripts

Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script.


### Bicleaner

It is recommended to use Bicleaner ML models to filter noisy data.
See more details on how to configure it in the [Model training guide, Bicleaner section](training-guide.md/#bicleaner).


## OpusCleaner

Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.

Config setting:
```
use-opuscleaner: true
```

## Custom filter configs
The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset
to get a training corpus with less noise and train higher quality translation models.

Filtering rules can be tuned in an interactive UI.

### Installation

Install the OpusCleaner UI on a server.
See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner).

For local usage: run from a poetry shell `make opuscleaner-ui`.
Then go to `http://0.0.0.0:8000`.

### Making filters

Choose a language pair and download the required OPUS datasets.
They will correspond to `opus_...` training datasets in the training pipeline config.

Configure cleaning rules for the datasets in the UI.

Copy JSON files for the produced filters `data/train-parts/*.filter.json` to
`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/`.

### Default config

If no custom config was specifed for the dataset,
the [default config template](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used.

Modify if needed. Some rules require specifying source or target language.
The `<src>` and `<trg>` in the template will be automatically replaced with the trained language pair.
The generated default config will be copied to the target dataset cleaning directory.

### Running

Enable OpusCleaner in the training pipeline config and run the pipeline as usual.
OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/clean-corpus.sh) script.
49 changes: 10 additions & 39 deletions docs/data.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# Data
---
layout: default
title: Datasets
nav_order: 4
---

This section includes instructions on how to find and configure datasets and cleaning procedures.
# Dataset importers

## Dataset importers

Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml).
Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.test.yml).

Example:
```
Expand All @@ -25,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da
[Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"

You can also use [find-corpus](/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
You can also use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.

Set up a local [poetry](https://python-poetry.org/) environment.
```
Expand All @@ -36,38 +38,7 @@ python utils/find-corpus.py en ru sacrebleu
```
Make sure to check licenses of the datasets before using them.

### Adding a new importer
## Adding a new importer

Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `<prefix>.sh`
Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/mono) which is named as `<prefix>.sh`
and accepts the same parameters as the other scripts from the same folder.

## Dataset fixing

Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in [pipeline/clean/fixes](/pipeline/clean/fixes).
Naming convention:
- `<dataset_name>.sh` for parallel dataset cleaning
- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
- `/` in dataset name should be replaced with `_`

## Dataset cleaning
Some parallel datasets require more aggressive filtering.
Dataset specific Bicleaner thresholds can be set in config.
`0` means skipping filtering entirely (useful for Paracrawl).

Example:

```
experiment:
...
bicleaner:
default-threshold: 0.5
dataset-thresholds:
opus_ParaCrawl/v8: 0
mtdata_neulab_tedtalksv1_train: 0.6
```

### OpusCleaner

Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.

See more details in the [dedicated doc](opus-cleaner.md).
6 changes: 6 additions & 0 deletions docs/development.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
---
layout: default
title: Development
nav_order: 7
---

# Development

## Architecture
Expand Down
4 changes: 4 additions & 0 deletions docs/img/logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 38 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
layout: default
title: Home
nav_order: 1
description: "Firefox Translations Training documentation."
permalink: /
---

# Firefox Translations training
Training pipelines for Firefox Translations machine translation models.

The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository,
compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and
power the Firefox web page translation starting with version 118.

The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.

## Training pipeline

The pipeline is capable of training a translation model for a language pair end to end.
Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters.
Some settings, especially low resource languages might require extra tuning.

We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine .

## Learning resources

- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
- [Model training guide](training-guide.md) - practical advice on how to use the pipeline
- [Reference papers](references.md)


## Acknowledgements
This project uses materials developed by:
- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303
- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
- OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/))
- Many other open source projects and research papers (see [References](references.md))
47 changes: 0 additions & 47 deletions docs/opus-cleaner.md

This file was deleted.

21 changes: 21 additions & 0 deletions docs/orchestrators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: default
title: Orchestrators
nav_order: 6
has_children: true
has_toc: false
---

# Orchestrators

An orchestrator is responsible for workflow management and parallelization.

Supported orchestrators:

- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI.
It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability.
[Usage instructions](task-cluster.md).
- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that can be used to run the pipeline locally or on a Slurm cluster.
[Usage instructions](snakemake.md).

Mozilla is currently switching to Taskcluster and the Snakemake workflow will be less actively maintained in the future.
11 changes: 8 additions & 3 deletions docs/pipeline-steps.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
layout: default
title: Pipeline steps
nav_order: 3
---

# Pipeline steps

Expand All @@ -10,14 +15,14 @@ Step | Description | Bottleneck | Comments
--- | --- | --- | ---
Installation | Installing dependencies and compiling | CPU | Takes ~1 hour
Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py).
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py).
Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning).
Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk |
Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU |
Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode.
Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive.
Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization.
Expand Down
9 changes: 8 additions & 1 deletion docs/references.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
---
layout: default
title: References
nav_order: 8
---

# References

Here is a list of selected publications on which the training pipeline is based.
Expand All @@ -15,7 +21,6 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020

3. Mölder F, Jablonski KP, Letcher B, et al. [Sustainable data analysis with Snakemake](https://pubmed.ncbi.nlm.nih.gov/34035898/). F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2


4. [Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task](https://aclanthology.org/2020.ngt-1.26) (Bogoychev et al., NGT 2020)

5. [From Research to Production and Back: Ludicrously Fast Neural Machine Translation](https://aclanthology.org/D19-5632) (Kim et al., EMNLP 2019)
Expand All @@ -32,3 +37,5 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020
14. Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). [A Simple, Fast, and Effective Reparameterization of IBM Model 2](http://www.ark.cs.cmu.edu/cdyer/fast_valign.pdf). In Proc. of NAACL.
15. [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162) (Sennrich et al., ACL 2016)
16. [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) (Taku Kudo, 2018)
17. [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf) (Zaragoza-Bernabeu et al., LREC 2022)
18. [Sequence-Level Knowledge Distillation](https://arxiv.org/abs/1606.07947) (Yoon Kim, Alexander M. Rush, EMNLP 2016)
20 changes: 7 additions & 13 deletions docs/snakemake.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
layout: default
title: Snakemake
nav_order: 2
parent: Orchestrators
---

# Snakemake

This section included the instructions on how to run the pipeline
Expand Down Expand Up @@ -284,16 +291,3 @@ The main directories inside `SHARED_ROOT` are:
│ └ ru-en
│ └ test
│ └ clean_corpus.log


## Utilities

### Tensorboard

To see training graphs run tensorboard:

```
make install-tensorboard
make tensorboard
```
Then port forward 6006.
Loading

1 comment on commit 2df0a3a

@firefoxci-taskcluster
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh! Looks like an error! Details

Taskcluster-GitHub attempted to create a task for this event with the following scopes:

["assume:repo:github.com/mozilla/firefox-translations-training:tag:0.4.0","queue:route:checks","queue:scheduler-id:taskcluster-github"]

The expansion of these scopes is not sufficient to create the task, leading to the following:

Client ID static/taskcluster/github does not have sufficient scopes and is missing the following scopes:

assume:repo:github.com/mozilla/firefox-translations-training:branch:0.4.0

This request requires the client to satisfy the following scope expression:

{
  "AllOf": [
    "assume:repo:github.com/mozilla/firefox-translations-training:branch:0.4.0",
    "queue:route:checks",
    "queue:route:tc-treeherder.v2.firefox-translations-training.2df0a3a905a26fed7e6a6e48ccb2156f29282b4a",
    "queue:route:index.translations.v2.firefox-translations-training.latest.taskgraph.decision",
    "queue:route:index.translations.v2.firefox-translations-training.revision.2df0a3a905a26fed7e6a6e48ccb2156f29282b4a.taskgraph.decision",
    "queue:create-task:project:none",
    "queue:scheduler-id:translations-level-1",
    {
      "AnyOf": [
        "queue:create-task:highest:translations-1/decision-gcp",
        "queue:create-task:very-high:translations-1/decision-gcp",
        "queue:create-task:high:translations-1/decision-gcp",
        "queue:create-task:medium:translations-1/decision-gcp",
        "queue:create-task:low:translations-1/decision-gcp",
        "queue:create-task:very-low:translations-1/decision-gcp"
      ]
    }
  ]
}

  • method: createTask
  • errorCode: InsufficientScopes
  • statusCode: 403
  • time: 2023-11-06T18:20:21.815Z

Please sign in to comment.