Cross Lingual Transfer Learning project, conducted as part of the Natural Language Processing 2 course at Universiteit van Amsterdam. This work has been done by Aman Hussain and Emil Dudev, with Dennis Ulmer as supervisor.
Cross-lingual fine-tuning has been widely used to bridge the gap between high-resource & low-resource languages. In this paper, we study the evolution of the learned representations during cross-lingual fine-tuning. We fine-tune a pre-trained multi-lingual BERT on a small Dutch corpus. A BERT model, pre-trained on Dutch exclusively, is used as a comparative baseline. We show that our transferred multi-lingual BERT learns a different representation subspace than the native model. Additionally, we explore the loss in multi-lingual capacity during fine-tuning.
Our library dependencies are listed in ./requirements.txt
.
Installation of requirements can happen through pip3 install -r requirements.txt
.
This environment setup is already present in trainer.ipynb
.
The backbone of our code is located in ./cross_lingual/
, while training and testing are managed by ./main.py
.
Samples commands:
$ ./main.py --config=configs.finetune_bert --train
$ ./main.py --config=configs.finetune_bert_gradual --test
Configuration for the training and test processes are located in ./configs/
, having each config comprise a single dictionary with parameters.
During the training process, checkpoints (as well as the best trained model) are stored in a new directory, ./checkpoints/
, while logs and hidden state values are stored in ./results/
.
After training is (sufficiently) complete, analysis can be performed on the trained models and logged hidden states.
All our analysis code is present in ./notebooks/
.
Our core source code is organized under ./cross_lingual/
as follows:
datasets/
: Dataset class for reading the Dutch and XNLI corpustrainer.py
: Implements the train, validationa and test routineutils.py
: utility functions for logging experimentshidden_state_utils.py
: functions for experimenting with hidden states
All notebooks (including trainer.ipynb
) have been run on Google Colab.
We do note that our saved checkpoints and activations take about 1 GiB of disk space per item. Our total disk usage for the whole experiment amounted about 2 TiB.
Initially, the XNLI dataset is reduced in size to match our Dutch dataset size (see the linked paper).
This is done in data_preparation.ipynb
.
Afterwards, the models we examined are trained (see ./configs/
), which was done with the trainer.ipynb
notebook.
Its first cell sets up the environment, be it locally or on Google Colab.
The other notebooks contain various experiments.
embedding_analysis.ipynb
looked at differences in the embeddings of similar words in English and Dutch. These results are not included in our report.
perplexity_analysis.ipynb
contains analysis and graph generation of our perplexity experiments across models and datasets.
Similarly, hidden-state-graphs.ipynb
is used go generate all our graphs regarding the hidden state analysis.