Skip to content

Case Marking Versus Word Order in Neural Machine Translation

Notifications You must be signed in to change notification settings

573phn/cm-vs-wo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Case Marking Versus Word Order in Neural Machine Translation

This project compares the accuracy of a NMT system when used on two fixed order languages versus a flexible word order language that uses case marking.

Getting Started

These instructions will get you a copy of the project up and running on the Peregrine HPC cluster. The assumption is made that you are already logged in on the cluster.

Step 1: Clone this repository

Clones the content of this repository to the cluster.

git clone https://github.com/573phn/cm-vs-wo.git /home/$USER/cm-vs-wo

Step 2: Change working directory

Changes the current working directory to the newly created cm-vs-wo directory.

cd /home/$USER/cm-vs-wo

Step 3: Execute setup.sh

Clones the OpenNMT-py repository and prepares a virtual Python 3 environment in /data/$USER/cm-vs-wo.

./setup.sh

Directory structure after setup

After going through the steps above, the directory structure should look as follows:

/
├── data
│   └── $USER
│       └── cm-vs-wo
│           ├── data
│           │   ├── mix
│           │   ├── vos
│           │   └── vso
│           ├── env
│           └── OpenNMT-py
└── home
    └── $USER
        └── cm-vs-wo
            ├── data
            │   ├── mix
            │   ├── vos
            │   └── vso
            └── slurm

Usage

After getting the project up and running, a job script can be submitted to the Peregrine HPC cluster. Depending on what you want to do, you can use one of the following commands:

Reproduce thesis research

sbatch thesis.sh

This command will reproduce the research and results of my thesis. The steps it will take are as follows:

  1. Pre-processes each corpus
  2. Trains the following models for each corpus:
  • 2-layer LSTM with attention
  • 2-layer LSTM without attention
  • 6-layer Transformer with Adam optimization with label smoothing
  • 6-layer Transformer with SGD optimization with label smoothing
  • 2-layer Transformer with Adam optimization without label smoothing
  • 2-layer Transformer with SGD optimization without label smoothing
  • 2-layer Transformer with SGD optimization with label smoothing
  1. Tests each model and calculates its accuracy per training checkpoint

Pre-process a corpus

sbatch preprocess.sh [vso|vos|mix]

Creates the following files in /data/$USER/cm-vs-wo/data/[vso|vos|mix]:

  • ppd.vocab.pt: serialized PyTorch file containing training data
  • ppd.valid.0.pt: serialized PyTorch file containing validation data
  • ppd.train.0.pt: serialized PyTorch file containing vocabulary data

Train a model

sbatch train.sh [vso|vos|mix] rnn [none|general]
sbatch train.sh [vso|vos|mix] transformer [sgd|adam] [large|small] [0.0|0.1]

Creates the following files in /data/$USER/cm-vs-wo/data/[vso|vos|mix]:

  • trained_model_[vso|vos|mix]_[rnn|transformer]_[adam|sgd]_[large|small|onesize]_ls[0_0|0_1]_step_N.pt: the trained model, where
    • [vso|vos|mix] is the word order
    • [rnn|transformer] is the model used
    • [adam|sgd] is the optimization method
    • [large|small|onesize] is the size of the model, this is large or small for Transformer and always onesize for RNN
    • [0_0|0_1] is the amount of label smoothing, which is 0.0 or 0.1, this is always 0.0 for RNN
    • N is the number of steps (a checkpoint is saved after every 50 steps)

Translate test set using trained model

sbatch translate.sh [vso|vos|mix] rnn [none|general]
sbatch translate.sh [vso|vos|mix] transformer [sgd|adam] [large|small] [0.0|0.1]

Creates the following files in /data/$USER/cm-vs-wo/data/[vso|vos|mix]:

  • out_test_[vso|vos|mix]_[rnn|transformer]_[adam|sgd]_[large|small|onesize]_ls[0_0|0_1]_step_N.txt: sentences as translated by the model, where
    • [vso|vos|mix] is the word order
    • [rnn|transformer] is the model used
    • [adam|sgd] is the optimization method
    • [large|small|onesize] is the size of the model, this is large or small for Transformer and always onesize for RNN
    • [0_0|0_1] is the amount of label smoothing, which is 0.0 or 0.1, this is always 0.0 for RNN
    • N is the number of steps the model has been trained Accuracy scores are printed to the slurm log file in /home/$USER/cm-vs-wo/slurm/translate-job-ID.log, where ID is the job ID.

Built with

About

Case Marking Versus Word Order in Neural Machine Translation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages