This repository contains scripts to perform Neural Machine Translation (NMT) experiments.
NMT models can be trained and evaluated on
- Europarl (v7)
- OpenSubtitles (v2018)
- News-Commentary
- MTNT
- Foursquare
and evaluated (but not trained) on
- PFSMB (aka Cr#pbank).
OAR scripts are also available in the oar
branch.
Python 3.6 with packages
- sacrebleu 1.14.4
- sentencepiece 0.1.91
- fastText
- nltk 3.5
- pandas 1.1.5 (only required by
statitics/
)
Environment variables
export MOSES_SCRIPTS=$HOME/marian-dev/examples/tools/moses-scripts/scripts
export TOOLS=$HOME/robust_bench/tools
export DATA=/data/almanach/user/$(whoami)
or in a conda environement (see documentation)
conda activate nmt
conda env config vars set MOSES_SCRIPTS=$HOME/marian-dev/examples/tools/moses-scripts/scripts
conda env config vars set TOOLS=$HOME/robust_bench/tools
conda env config vars set DATA=/data/almanach/user/$(whoami)
conda activate nmt
The copora are expected to be located in $DATA/
.
marian-dev (commit 467b15e)
export MARIAN=$HOME/marian-dev/build
- gcc 7.3.0
- CUDA 9.2
Commit c8a0659
- Pytorch 1.5.0
- fairseq (commit c8a0659)
- CUDA 11.0
- CUDNN 8.0
- cmake 3.10.1
export MKL_THREADING_LAYER=GNU
From src/
directory:
./run_experiment.sh \
-src $src -tgt $tgt \
--framework fairseq \
--dataset MTNT \
-sseg char \
--nwordssrc 128 \
-tseg bpe \
-nwordstgt 16000 \
-arch convtransformer \
--model $DATA/models.fairseq/convTrb.Europarl.MTNT.char.bpe.fr-en \
--gpus "0" \
--output-dir $DATA/translations/Trb.Europarl.MTNT.char.bpe.fr-en
Arguments:
-f
or--framework
: marian or fairseq-arch
or--architecture
: transformer, convtransformer-s
or--source
: en, fr-t
or--target
: en, fr-sseg
or--src_segmentation
: char, bpe-tseg
or--tgt_segmentation
: char, bpe--joint-dictionary
: Uses the same dictionary for source and target.-nws
or--nwordssrc
: Size of the source side vocabulary (ignored if src_segmentation is char).-nwt
or--nwordstgt
: Size of the target side vocabulary (ignored if joint-vocabulary or if tgt_segmentation is char).-d
or--dataset
: MTNT, News-Commentary, Europarl, Europarl_small, OpenSubtitles, OpenSubtitles_small, Crapbank, Foursquare--no-shuffle
: uses unshuffled MTNT training data (MTNT only).-r
or--ratio
: uses a subsampled training data (MTNT only)-m
or--model
: model path-ckpt
or--checkpoint
: checkpoint to use (default:checkpoint_best.pt
)-btm
or--back-translation-model
-o
or--output-dir
: path of the output translations of the development set-vo
or--val-output-dir
: path of the output translations at each validation step--gpus
: gpus to use (only for marian, fairseq uses everything by default)
guild.ai can be used for logging experiments.
From src/
directory:
guild run -y --label $label \
src=$src tgt=$tgt \
framework=fairseq \
dataset=MTNT \
sseg=char \
nwordssrc=128 \
tseg=bpe \
nwordstgt=16000 \
arch=transformer \
jointdict="" \
model=$DATA/models.fairseq/Trb.Europarl.MTNT.char.bpe.fr-en \
gpus="0" \
outputdir=$DATA/translations/Trb.Europarl.MTNT.char.bpe.fr-en
Experiments logic is handled by src/run_experiments
, successively
- Parsing the command line with the function
parse_cli
fromtools/parse_cli.sh
, - Preprocessing the dataset used in the experiment (if the corpus has not already been preprocessed in a previous expriment) by calling the relevant
preprocessing/
script, - Perfoming data augmentation (augmentation have not all been implemented),
- Training a model (if not already trained, i.e. if
checkpoint_best.pt
does not exist in the model checkpoints directory) by calling thetrain
function fromsrc/scripts_$framework/train_generate.sh
, - Translating the dataset source development side by calling the
translate_dev
function fromsrc/scripts_$framework/train_generate.sh
- Computing the BLEU score on the development set with
sacrebleu
tools/spm/
can be replaced withspm
if sentencepiece was installed from source,- Reusing the sentencepiece vocabulary when using
marian
was not tested and may require to adapt the code. - Simultaneous preprocessing of a single data may cause failures. As a dataset only has to be preprocessed once, do not launch batch of experiments on a dataset before the preprocessing of these dataset has been completed during a previous experiment.