Case Marking Versus Word Order in Neural Machine Translation

This project compares the accuracy of a NMT system when used on two fixed order languages versus a flexible word order language that uses case marking.

Getting Started

These instructions will get you a copy of the project up and running on the Peregrine HPC cluster. The assumption is made that you are already logged in on the cluster.

Step 1: Clone this repository

Clones the content of this repository to the cluster.

git clone https://github.com/573phn/cm-vs-wo.git /home/$USER/cm-vs-wo

Step 2: Change working directory

Changes the current working directory to the newly created cm-vs-wo directory.

cd /home/$USER/cm-vs-wo

Step 3: Execute `setup.sh`

Clones the OpenNMT-py repository and prepares a virtual Python 3 environment in /data/$USER/cm-vs-wo.

./setup.sh

Directory structure after setup

After going through the steps above, the directory structure should look as follows:

/
├── data
│   └── $USER
│       └── cm-vs-wo
│           ├── data
│           │   ├── mix
│           │   ├── vos
│           │   └── vso
│           ├── env
│           └── OpenNMT-py
└── home
    └── $USER
        └── cm-vs-wo
            ├── data
            │   ├── mix
            │   ├── vos
            │   └── vso
            └── slurm

Usage

After getting the project up and running, a job script can be submitted to the Peregrine HPC cluster. Depending on what you want to do, you can use one of the following commands:

Reproduce thesis research

sbatch thesis.sh

This command will reproduce the research and results of my thesis. The steps it will take are as follows:

Pre-processes each corpus
Trains the following models for each corpus:

2-layer LSTM with attention
2-layer LSTM without attention
6-layer Transformer with Adam optimization with label smoothing
6-layer Transformer with SGD optimization with label smoothing
2-layer Transformer with Adam optimization without label smoothing
2-layer Transformer with SGD optimization without label smoothing
2-layer Transformer with SGD optimization with label smoothing

Tests each model and calculates its accuracy per training checkpoint

Pre-process a corpus

sbatch preprocess.sh [vso|vos|mix]

Creates the following files in /data/$USER/cm-vs-wo/data/[vso|vos|mix]:

ppd.vocab.pt: serialized PyTorch file containing training data
ppd.valid.0.pt: serialized PyTorch file containing validation data
ppd.train.0.pt: serialized PyTorch file containing vocabulary data

Train a model

sbatch train.sh [vso|vos|mix] rnn [none|general]
sbatch train.sh [vso|vos|mix] transformer [sgd|adam] [large|small] [0.0|0.1]

Creates the following files in /data/$USER/cm-vs-wo/data/[vso|vos|mix]:

trained_model_[vso|vos|mix]_[rnn|transformer]_[adam|sgd]_[large|small|onesize]_ls[0_0|0_1]_step_N.pt: the trained model, where
- [vso|vos|mix] is the word order
- [rnn|transformer] is the model used
- [adam|sgd] is the optimization method
- [large|small|onesize] is the size of the model, this is large or small for Transformer and always onesize for RNN
- [0_0|0_1] is the amount of label smoothing, which is 0.0 or 0.1, this is always 0.0 for RNN
- N is the number of steps (a checkpoint is saved after every 50 steps)

Translate test set using trained model

sbatch translate.sh [vso|vos|mix] rnn [none|general]
sbatch translate.sh [vso|vos|mix] transformer [sgd|adam] [large|small] [0.0|0.1]

Creates the following files in /data/$USER/cm-vs-wo/data/[vso|vos|mix]:

out_test_[vso|vos|mix]_[rnn|transformer]_[adam|sgd]_[large|small|onesize]_ls[0_0|0_1]_step_N.txt: sentences as translated by the model, where
- [vso|vos|mix] is the word order
- [rnn|transformer] is the model used
- [adam|sgd] is the optimization method
- [large|small|onesize] is the size of the model, this is large or small for Transformer and always onesize for RNN
- [0_0|0_1] is the amount of label smoothing, which is 0.0 or 0.1, this is always 0.0 for RNN
- N is the number of steps the model has been trained Accuracy scores are printed to the slurm log file in /home/$USER/cm-vs-wo/slurm/translate-job-ID.log, where ID is the job ID.

Built with

OpenNMT-py - A PyTorch port of OpenNMT, an open-source neural machine translation system

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data		data
.gitignore		.gitignore
README.md		README.md
delay.sh		delay.sh
get_accuracy.py		get_accuracy.py
preprocess.sh		preprocess.sh
setup.sh		setup.sh
thesis.sh		thesis.sh
train.sh		train.sh
translate.sh		translate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Case Marking Versus Word Order in Neural Machine Translation

Getting Started

Step 1: Clone this repository

Step 2: Change working directory

Step 3: Execute `setup.sh`

Directory structure after setup

Usage

Reproduce thesis research

Pre-process a corpus

Train a model

Translate test set using trained model

Built with

About

Releases

Packages

Languages

573phn/cm-vs-wo

Folders and files

Latest commit

History

Repository files navigation

Case Marking Versus Word Order in Neural Machine Translation

Getting Started

Step 1: Clone this repository

Step 2: Change working directory

Step 3: Execute setup.sh

Directory structure after setup

Usage

Reproduce thesis research

Pre-process a corpus

Train a model

Translate test set using trained model

Built with

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 3: Execute `setup.sh`

Packages