TLDR; The authors train a single Neural Machine Translation model that can translate between N*M language pairs, with a parameter spaces that grows linearly with the number of languages. The model uses a single attention mechanism shared across encoders/decoders. The authors demonstrate the the model performs particularly well for resource-constrained languages, outperforming single-pair models trained on the same data.
- Attention mechanism: Both encoder and decoder output attention-specific vectors, which are then combined. Thus, adding a new source/target language does not result in a quadratic explosion of parameters.
- Bidirectional RNN, 620-dimensional embeddings, GRU with 1k units, 1k affine layer tanh. Adam, minibatch 60 examples. Only use sentence up to length 50.
- Model clearly outperforms single-pair models when parallel corpora are constrained to small size. Not so much for large corpora.
- The single model doesn't fit on a GPU.
- Can in theory be used to translate between pairs that didn't have a bilingual training corpus, but the authors don't evaluate this in the paper.
- Main difference to "Multi-task Sequence to Sequence Learning": Uses attention mechanism
- I don't see anything that would force the encoders to map sequences of different languages into the same representation (as the authors briefly mentioned). Perhaps it just encodes language-specific information that the decoders can use to decide which source language it was?