RCSLS [1] is a method for cross-lingual word embedding alignment, originally implemented in NumPy. This is a PyTorch implementation of the algorithm, based on the MUSE codebase.
Instructions from MUSE:
- Python 2/3 with NumPy/SciPy
- PyTorch
- Faiss (recommended) for fast nearest neighbor search (CPU or GPU).
MUSE is available on CPU or GPU, in Python 2 or 3. Faiss is optional for GPU users - though Faiss-GPU will greatly speed up nearest neighbor search - and highly recommended for CPU users. Faiss can be installed using "conda install faiss-cpu -c pytorch" or "conda install faiss-gpu -c pytorch".
Get monolingual and cross-lingual word embeddings evaluation datasets:
- Our 110 bilingual dictionaries
- 28 monolingual word similarity tasks for 6 languages, and the English word analogy task
- Cross-lingual word similarity tasks from SemEval2017
- Sentence translation retrieval with Europarl corpora
by simply running (in data/):
./get_evaluation.sh
Note: Requires bash 4. The download of Europarl is disabled by default (slow), you can enable it here.
# English fastText Wikipedia embeddings
wget -c "https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/data/wiki.en.vec" -P data/
# Spanish fastText Wikipedia embeddings
wget -c "https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/data/wiki.es.vec" -P data/
Replace en
with other language codes for other languages.
python supervised.py --src_lang en --tgt_lang es--src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --seed 2
By default, dico_train will point to our ground-truth dictionaries (downloaded above); when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary. Logs and embeddings will be saved in the dumped/ directory.
The results reported bellow are with a seed fixed to 2.
MUSE also includes a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:
Monolingual
python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000
Cross-lingual
python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000
By default, the aligned embeddings are exported to a text format at the end of experiments: --export txt
. Exporting embeddings to a text file can take a while if you have a lot of embeddings. For a very fast export, you can set --export pth
to export the embeddings in a PyTorch binary file, or simply disable the export (--export ""
).
When loading embeddings, the model can load:
- PyTorch binary files previously generated by MUSE (.pth files)
- fastText binary files previously generated by fastText (.bin files)
- text files (text file with one word embedding per line)
The two first options are very fast and can load 1 million embeddings in a few seconds, while loading text files can take a while.
The reported precision scores are for a fixed number of epochs (10) and with a new unsupervised model selection criterion based on the size of the dictionary induced based on every epoch, based on the current cross-lingual alignment: a bigger dictionary appears to be indicative of a better alignment.
Code | en-es | es-en | en-fr | fr-en | en-de | de-en | en-ru | ru-en | en-zh | zh-en | avg |
---|---|---|---|---|---|---|---|---|---|---|---|
Joulin et al. [1] | 84.1 | 86.3 | 83.3 | 84.1 | 79.1 | 76.3 | 57.9 | 67.2 | 45.9 | 46.4 | 71.1 |
This implementation (10 epochs) | 84.2 | 86.6 | 83.9 | 84.7 | 78.3 | 76.6 | 57.6 | 66.7 | 47.6 | 47.4 | 71.4 |
This implementation (unsup. model selection) | 84.3 | 86.6 | 83.9 | 85.0 | 78.7 | 76.7 | 57.6 | 67.1 | 47.6 | 47.4 | 71.5 |
[1] A. Joulin, P. Bojanowski, T. Mikolov, H. Jegou, E. Grave, Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion
@InProceedings{joulin2018loss,
title={Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion},
author={Joulin, Armand and Bojanowski, Piotr and Mikolov, Tomas and J\'egou, Herv\'e and Grave, Edouard},
year={2018},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
}