Informative Language Representation Learning for Massively Multilingual Neural Machine Translation.

Requirements

Python >= 3.7
PyTorch >= 1.6.0
Fairseq with commit ID d3890e5
sacrebleu < 2.0

Fairseq Installation

We build the multilingual neural machine translation models based on Fairseq library. Please install it first:

cd fairseq_dir
pip install -e ./

Data Preprocessing

There are four steps in the data preprocessing pipeline:

BPE model training
Subword segmentation
Removing long sentences
Binarizing the data with fairseq-preprocess

We provide script examples to run the pipeline described above for preprocessing the parallel corpus of multiple language pairs:

scripts/ted/data_process/multilingual_preprocess.sh
scripts/opus-100/data_process/multilingual_preprocess.sh

There is an artificial English tag __en__ prepended to every non-English sentence in the raw TED-59 dataset, which may affect model training and bias BLEU. Please execute scripts/ted/data_process/remove_start_token.sh to remove this tag before the preprocessing pipeline for TED-59 dataset (scripts/ted/data_process/multilingual_preprocess.sh).

Note that please insert the path of nmt-multi directory to the inport path before executing these shell scripts to make Python interpreter aware of the nmt package. There is an example below:

# modify the environment variable PYTHONPATH
export PYTHONPATH=/path/to/nmt-multi:${PYTHONPATH}

Model Training

Once the data has been preprocessed, the multilingual neural machine translation models can be trained with the shell scripts in the scripts/opus-100/train and scripts/ted/train folders. Note that please set the variables in these scripts properly before executing them:

bash scripts/opus-100/train/fairseq_train.many-many.laa.sh

The training for other models (e.g., token_src, token_tgt, lee) can be done in similar way, please refer scripts/opus-100/train and scripts/ted/train folders for more details.

Evaluation

The evaluation pipeline is composed of three steps:

Translate the validation sets (only for supervised language pairs) with the saved checkpoints

bash scripts/opus-100/evaluation/eval.valid.many-many.laa.sh

Select the best checkpoint according to the average BLEU on the validation sets

# calculate BLEU score
bash scripts/opus-100/evaluation/report_bleu.valid.many-many.laa.sh > report_bleu.valid.many-many.laa.logs

# convert report_bleu.valid.many-many.laa.logs into json format
# --input denotes the path of report_bleu.valid.many-many.laa.logs
# --output_json_data denotes the path of the output json file
bash scripts/opus-100/evaluation/multilingual_bleu_statistics.sh

# report the average BLEU score for each checkpoint
# the checkpoints will be printed in descending order of average BLEU
# the checkpoint with the highest average BLEU was selected as the best checkpoint in our work
bash scripts/opus-100/evaluation/get_best_checkpoint.sh

Translate the test sets with the selected checkpoint for both supervised and zero-shot translation

For supervised translation:

# translate the test sets of supervised language pairs
bash scripts/opus-100/evaluation/eval.test.many-many.laa.sh

# calculate BLEU score
bash scripts/opus-100/evaluation/report_bleu.test.many-many.laa.sh > report_bleu.test.many-many.laa.logs

# convert report_bleu.test.many-many.laa.logs into json format
# --input denotes the path of report_bleu.test.many-many.laa.logs
# --output_json_data denotes the path of the output json file
bash scripts/opus-100/evaluation/multilingual_bleu_statistics.sh

For zero-shot translation

# translate the test sets of zero-shot language pairs
bash scripts/opus-100/evaluation/eval.zero-shot.many-many.laa.sh

# calculate BLEU score
bash scripts/opus-100/evaluation/report_bleu.zero-shot.many-many.laa.sh > report_bleu.zero-shot.many-many.laa.logs

# convert report_bleu.zero-shot.many-many.laa.logs into json format
# --input denotes the path of report_bleu.zero-shot.many-many.laa.logs
# --output_json_data denotes the path of the output json file
bash scripts/opus-100/evaluation/multilingual_bleu_statistics.sh

The evaluation for other models (e.g., token_src, token_tgt, lee) can be done in similar way, please refer scripts/opus-100/evaluation and scripts/ted/evaluation folders for more details.

Citation

@inproceedings{DBLP:conf/coling/JinX22,
  author    = {Renren Jin and
               Deyi Xiong},
  editor    = {Nicoletta Calzolari and
               Chu{-}Ren Huang and
               Hansaem Kim and
               James Pustejovsky and
               Leo Wanner and
               Key{-}Sun Choi and
               Pum{-}Mo Ryu and
               Hsin{-}Hsi Chen and
               Lucia Donatelli and
               Heng Ji and
               Sadao Kurohashi and
               Patrizia Paggio and
               Nianwen Xue and
               Seokhwan Kim and
               Younggyun Hahm and
               Zhong He and
               Tony Kyungil Lee and
               Enrico Santus and
               Francis Bond and
               Seung{-}Hoon Na},
  title     = {Informative Language Representation Learning for Massively Multilingual
               Neural Machine Translation},
  booktitle = {Proceedings of the 29th International Conference on Computational
               Linguistics, {COLING} 2022, Gyeongju, Republic of Korea, October 12-17,
               2022},
  pages     = {5158--5174},
  publisher = {International Committee on Computational Linguistics},
  year      = {2022},
  url       = {https://aclanthology.org/2022.coling-1.458},
  timestamp = {Thu, 13 Oct 2022 17:29:38 +0200},
  biburl    = {https://dblp.org/rec/conf/coling/JinX22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
fairseq_dir		fairseq_dir
nmt		nmt
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Informative Language Representation Learning for Massively Multilingual Neural Machine Translation.

Requirements

Fairseq Installation

Data Preprocessing

Model Training

Evaluation

Citation

About

Releases

Packages

Languages

License

cordercorder/nmt-multi

Folders and files

Latest commit

History

Repository files navigation

Informative Language Representation Learning for Massively Multilingual Neural Machine Translation.

Requirements

Fairseq Installation

Data Preprocessing

Model Training

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages