Skip to content

Releases: mozilla/translations

0.4.0 Taskcluster

06 Nov 18:20
2df0a3a
Compare
Choose a tag to compare
  • Added support of Taskcluster that provides more scalable training and better observability
  • New page with updated documentation
  • Integrated OpusCleaner as optional data cleaner
  • Full test pipeline runs on pull request CI
  • Code checks and linting
  • Local tools run under poetry env

0.3.1 Quality improvements

06 Dec 23:09
3b3f33b
Compare
Choose a tag to compare
  • Update teacher hyperparameters
  • Use chrf metric for the best model (https://discourse.translatelocally.com/t/marian-configuration-to-use/24)
  • Update SacreBLEU and add chrF metric to evaluation.
  • Add evaluation of each model and the whole ensemble of teachers.
  • Early stop based on ce-mean-words instead of bleu-detok
  • Continue teacher training on parallel data only (train on augmented data for N epochs first)
  • Do cleaning per dataset
  • Add per-dataset fixes from https://github.com/ZJaume/clean/tree/master/fixes.
  • Use bicleaner per dataset with customizable thresholds.
  • Remove punctuation normalization
  • Add alphabets for more languages in the cleaning scripts
  • Replace absolute paths with relative ones
  • Add Snakemake cross-workflow caching. Caching works, but apparently, there is a bug in Snakemake, it doesn't recognize symlinks after caching. Disabled for now.

0.3.0 Workflow manager

28 Oct 18:10
a09b0ac
Compare
Choose a tag to compare
  • workflow management using Snakemake
  • parallelization to run on a cluster
  • Singularity containerization support
  • Slurm support
  • teacher ensemble support

0.2.1 Improvements

17 Aug 20:23
0f6e64c
Compare
Choose a tag to compare
  • Flores dataset importer
  • Custom dataset importer
  • Ability to use a pre-trained backward model
  • Save experiment config on start
  • Stubs for dataset caching ( decided to sync implementation with workflow manager integration )
  • Use best bleu models instead of best ce-mean-words
  • Fix linting warnings

0.2.0 Bicleaner

26 Jul 17:02
ec783cf
Compare
Choose a tag to compare
  • SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets.

  • Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs).

  • Added a script to find all datasets based on language pair and importer type, ready to use in config.

  • Fixed conda environment activation to be reproducible on GCP.

  • Other minor reproducibility fixes.

0.1.0 Basic pipeline

12 Jul 21:49
af2abbf
Compare
Choose a tag to compare

The initial pipeline allows training a language pair end to end on a standalone machine.

Test ru-en model was trained on opus paracrawl corpus.

There might be reproducibility issues depending on machine, language pair and datasets configurations.