This project is inspired by Stanford's CS224N NMT Project
Dataset used in this project: News Commentary v14
This project is more of a learning project to make myself familiar with Pytorch, machine translation, and NLP model training.
To investigate how would various setups of the recurrent layer affect the final performance, I compared Training Efficiency and Effectiveness of different types of RNN layer for encoder by changing one feature each time while controlling all other parameters:
- 
RNN types
- GRU
 - LSTM
 
 - 
Activation Functions on Output Layer
- Tanh
 - ReLU
 - LeakyReLU
 
 - 
Number of layers
- single layer
 - double layer
 
 
_/
├─ utils.py # utilities
├─ vocab.py # generate vocab
├─ model_embeddings.py # embedding layer
├─ nmt_model.py # nmt model definition
├─ run.py # training and testing
- 
source: 相反,这意味着合作的基础应当是共同的长期战略利益,而不是共同的价值观。
- target: Instead, it means that cooperation must be anchored not in shared values, but in shared long-term strategic interests.
 - translation: On the contrary, that means cooperation should be a common long-term strategic interests, rather than shared values.
 
 - 
source: 但这个问题其实很简单: 谁来承受这些用以降低预算赤字的紧缩措施的冲击。
- target: But the issue is actually simple: Who will bear the brunt of measures to reduce the budget deficit?
 - translation: But the question is simple: Who is to bear the impact of austerity measures to reduce budget deficits?
 
 - 
source: 上述合作对打击恐怖主义、贩卖人口和移民可能发挥至关重要的作用。
- target: Such cooperation is essential to combat terrorism, human trafficking, and migration.
 - translation: Such cooperation is essential to fighting terrorism, trafficking, and migration.
 
 - 
source: 与此同时, 政治危机妨碍着政府追求艰难的改革。
- target: At the same time, political crisis is impeding the government’s pursuit of difficult reforms.
 - translation: Meanwhile, political crises hamper the government’s pursuit of difficult reforms.
 
 
Preprocessing Colab notebook
- using 
jiebato separate Chinese words by spaces 
- 
Input: training data of Chinese and English
 - 
Output: a vocab file containing mapping from (sub)words to ids of Chinese and English -- a limited size of vocab is selected using SentencePiece (essentially Byte Pair Encoding of character n-grams) to cover around 99.95% of training data
 
- 
a Seq2Seq model with attention
This image is from the book DIVE INTO DEEP LEARNING
- Encoder
- A Recurrent Layer
 
 - Decoder
- LSTMCell (hidden_size=512)
 
 - Attention
- Multiplicative Attention
 
 
 - Encoder
 
Training Colab notebook
- Hyperparameters:
- Embedding Size & Hidden Size: 512
 - Dropout Rate: 0.25
 - Starting Learning Rate: 5e-4
 - Batch Size: 32
 - Beam Size for Beam Search: 10
 
 - NOTE: The BLEU score calculated here is based on the 
Test Set, so it could only be used to compare the relative effectiveness of the models using this data 
- Dataset: the dataset is split into training set(~260000), validation set(~20000), and testing set(~20000) randomly (they are the same for each experiment group)
 - Max Number of Iterations: 50000
 - NOTE: I've tried Vanilla-RNN(nn.RNN) in various ways, but the BLEU score turns out to be extremely low for it (absence of 
residual connectionsmight be the issue)- I decided to not include it for comparison until the issue is resolved
 
 
Bidirectional 2-Layer LSTM with Tanh, 1024 embed_size & hidden_size, trained 11517.19 sec (44000 iterations), BLEU score 17.95
| Traning Time | BLEU Score on Test Set | Training Perplexities | Validation Perplexities | |
|---|---|---|---|---|
| Best Model | 11517.19 | 17.95 | 
- LSTM tends to have better performance than GRU (it has an extra set of parameters)
 - Tanh tends to be better since less information is lost
 - Making the LSTM deeper (more layers) could improve the performance, but it cost more time to train
 - Surprisingly, the training time for A, B, and D are roughly the same
- the issue may be the dataset is not large enough, or the cloud service I used to train models does not perform consistently
 
 
- source: 全球目击组织(Global Witness)的报告记录, 光是2015年就有16个国家的185人被杀。
- target: A Global Witness report documented 185 killings across 16 countries in 2015 alone.
 - translation: According to the Global eye, the World Health Organization reported that 185 people were killed in 2015.
 - problems:
- Information Loss: 16 countries
 - Unknown Proper Noun: Global Witness
 
 
 - source: 大自然给了足以满足每个人需要的东西, 但无法满足每个人的贪婪。
- target: Nature provides enough for everyone’s needs, but not for everyone’s greed.
 - translation: Nature provides enough to satisfy everyone.
 - problems:
- Huge Information Loss
 
 
 - source: 我衷心希望全球经济危机和巴拉克·奥巴马当选总统能对新冷战的荒唐理念进行正确的评估。
- target: It is my hope that the global economic crisis and Barack Obama’s presidency will put the farcical idea of a new Cold War into proper perspective.
 - translation: I do hope that the global economic crisis and President Barack Obama will be corrected for a new Cold War.
 - problems:
- Action Sender And Receiver Exchanged
 - Failed To Translate Complex Sentence
 
 
 - source: 人们纷纷猜测欧元区将崩溃。
- target: Speculation about a possible breakup was widespread.
 - translation: The eurozone would collapse.
 - problems:
- Significant Information Loss
 
 
 
- Dataset
- The dataset is fairly small, and our model is not being trained thorough all data
 - Being a native Chinese speaker, I could not understand what some of the source sentences are saying
 - The target sentences are not informational comprehensive; they themselves need context to be understood (e.g. the target sentence in the last "Bad Examples")
 - Even for human, some of the source sentence was too hard to translate
 
 - Model Architecture
- CNN & Transformer
 - character based model
 - Make the model even larger & deeper (... I need GPUs)
 
 - Tricks that might help
- Add a proper noun dictionary to translate unknown proper nouns word-by-word (phrase-by-phrase)
 - Initialize (sub)word embedding with pretrained embedding
 
 
- Download the dataset you desire, and change all "./zh_en_data" in 
run.shto the path where your data is stored - To run locally on a CPU (mostly for sanity check, CPU is not able to train the model)
- set up the environment using conda/miniconda 
conda env create --file local env.yml 
 - set up the environment using conda/miniconda 
 - To run on a GPU
- set up the environment and running process following the Colab notebook
 
 
If you have any questions or you have trouble running the code, feel free to contact me via email