Releases: mozilla/translations
Releases · mozilla/translations
0.4.0 Taskcluster
- Added support of Taskcluster that provides more scalable training and better observability
- New page with updated documentation
- Integrated OpusCleaner as optional data cleaner
- Full test pipeline runs on pull request CI
- Code checks and linting
- Local tools run under poetry env
0.3.1 Quality improvements
- Update teacher hyperparameters
- Use chrf metric for the best model (https://discourse.translatelocally.com/t/marian-configuration-to-use/24)
- Update SacreBLEU and add chrF metric to evaluation.
- Add evaluation of each model and the whole ensemble of teachers.
- Early stop based on ce-mean-words instead of bleu-detok
- Continue teacher training on parallel data only (train on augmented data for N epochs first)
- Do cleaning per dataset
- Add per-dataset fixes from https://github.com/ZJaume/clean/tree/master/fixes.
- Use bicleaner per dataset with customizable thresholds.
- Remove punctuation normalization
- Add alphabets for more languages in the cleaning scripts
- Replace absolute paths with relative ones
- Add Snakemake cross-workflow caching. Caching works, but apparently, there is a bug in Snakemake, it doesn't recognize symlinks after caching. Disabled for now.
0.3.0 Workflow manager
- workflow management using Snakemake
- parallelization to run on a cluster
- Singularity containerization support
- Slurm support
- teacher ensemble support
0.2.1 Improvements
- Flores dataset importer
- Custom dataset importer
- Ability to use a pre-trained backward model
- Save experiment config on start
- Stubs for dataset caching ( decided to sync implementation with workflow manager integration )
- Use best bleu models instead of best ce-mean-words
- Fix linting warnings
0.2.0 Bicleaner
-
SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets.
-
Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs).
-
Added a script to find all datasets based on language pair and importer type, ready to use in config.
-
Fixed conda environment activation to be reproducible on GCP.
-
Other minor reproducibility fixes.
0.1.0 Basic pipeline
The initial pipeline allows training a language pair end to end on a standalone machine.
Test ru-en model was trained on opus paracrawl corpus.
There might be reproducibility issues depending on machine, language pair and datasets configurations.