Releases · mozilla/translations · GitHub

06 Nov 18:20

eu9ene

0.4.0 Taskcluster Latest

Latest

Added support of Taskcluster that provides more scalable training and better observability
New page with updated documentation
Integrated OpusCleaner as optional data cleaner
Full test pipeline runs on pull request CI
Code checks and linting
Local tools run under poetry env

Assets 2

06 Dec 23:09

eu9ene

0.3.1 Quality improvements

Update teacher hyperparameters
Use chrf metric for the best model (https://discourse.translatelocally.com/t/marian-configuration-to-use/24)
Update SacreBLEU and add chrF metric to evaluation.
Add evaluation of each model and the whole ensemble of teachers.
Early stop based on ce-mean-words instead of bleu-detok
Continue teacher training on parallel data only (train on augmented data for N epochs first)
Do cleaning per dataset
Add per-dataset fixes from https://github.com/ZJaume/clean/tree/master/fixes.
Use bicleaner per dataset with customizable thresholds.
Remove punctuation normalization
Add alphabets for more languages in the cleaning scripts
Replace absolute paths with relative ones
Add Snakemake cross-workflow caching. Caching works, but apparently, there is a bug in Snakemake, it doesn't recognize symlinks after caching. Disabled for now.

Assets 2

28 Oct 18:10

eu9ene

0.3.0 Workflow manager

workflow management using Snakemake
parallelization to run on a cluster
Singularity containerization support
Slurm support
teacher ensemble support

Assets 2

17 Aug 20:23

eu9ene

0.2.1 Improvements

Flores dataset importer
Custom dataset importer
Ability to use a pre-trained backward model
Save experiment config on start
Stubs for dataset caching ( decided to sync implementation with workflow manager integration )
Use best bleu models instead of best ce-mean-words
Fix linting warnings

Assets 2

26 Jul 17:02

eu9ene

0.2.0 Bicleaner

SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets.
Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs).
Added a script to find all datasets based on language pair and importer type, ready to use in config.
Fixed conda environment activation to be reproducible on GCP.
Other minor reproducibility fixes.

Assets 2

12 Jul 21:49

eu9ene

0.1.0 Basic pipeline

The initial pipeline allows training a language pair end to end on a standalone machine.

Test ru-en model was trained on opus paracrawl corpus.

There might be reproducibility issues depending on machine, language pair and datasets configurations.

Assets 2