Name	Name	Last commit message	Last commit date
parent directory ..
reduced_tokenizers	reduced_tokenizers
README.md	README.md
reduce_tokenizer.py	reduce_tokenizer.py
run_pretraining.py	run_pretraining.py

Pretraining

Data

Assuming you have initialized the git submodules and installed the wiki-bert-pipeline dependencies as instructed here, you can download and preprocess a Wikipedia dump for a given language as follows:

# Example for Korean, working directory should be the repo's main folder
export LANG="ko"

# Pipeline will download the Wiki dump, extract it, and tokenize it using UDPipe
./lib/wiki-bert-pipeline/run.sh $LANG

# Keep one of the files for evaluation (replace tokenized-texts by filtered-texts for Finnish)
mv lib/wiki-bert-pipeline/data/$LANG/tokenized-texts/AA/wiki_00 lib/wiki-bert-pipeline/data/$LANG/eval_file

Important: This setup will download the latest Wiki dump for the specified language. In our experiments, we have used older dumps (from June 20, 2020 - e.g. fiwiki-20200720-pages-articles.xml.bz2 for Finnish). If you would like to use an older dump, you can change the dump's download link in the lib/wiki-bert-pipeline/languages/$LANG.json file.

Model Training

MonoModel-MonoTok

Show

You can start pretraining a MonoModel-MonoTok model as shown below.

Important:

Ensure to pass --overwrite_cache because many of the wiki files have the same names and will otherwise not be read correctly
For the Finnish MonoModel-MonoTok, additionally pass the flag --whole_word_mask and replace tokenized-texts by filtered-texts in the data path

# Example for Korean, working directory is repo's main folder
export LANG="ko"
export TOKENIZER="snunlp/KR-BERT-char16424"

# First phase of pretraining with block size 128
python pretraining/run_pretraining.py \
        --output_dir $LANG-monomodel-monotok-p1 \
        --model_type bert \
        --tokenizer_name $TOKENIZER \
        --do_train \
        --train_data_files lib/wiki-bert-pipeline/data/$LANG/tokenized-texts \
        --do_eval \
        --eval_data_file lib/wiki-bert-pipeline/data/$LANG/eval_file \
        --mlm \
        --max_steps 900000 \
        --warmup_steps 10000 \
        --weight_decay 0.01 \
        --learning_rate 1e-4 \
        --per_device_train_batch_size 64 \
        --gradient_accumulation_steps 1 \
        --block_size 128 \
        --save_steps 10000 \
        --save_total_limit 5 \
        --evaluate_during_training \
        --logging_steps 500 \
        --fp16 \
        --overwrite_cache
     
# Second phase of pretraining with block size 512
python pretraining/run_pretraining.py \
        --output_dir $LANG-monomodel-monotok-p2 \
        --model_type bert \
        --model_name_or_path $LANG-monomodel-monotok-p1 \
        --tokenizer_name $TOKENIZER \
        --do_train \
        --train_data_files lib/wiki-bert-pipeline/data/$LANG/tokenized-texts \
        --do_eval \
        --eval_data_file lib/wiki-bert-pipeline/data/$LANG/eval_file \
        --mlm \
        --max_steps 100000 \
        --warmup_steps 10000 \
        --weight_decay 0.01 \
        --learning_rate 1e-4 \
        --per_device_train_batch_size 16 \
        --gradient_accumulation_steps 4 \
        --block_size 512 \
        --save_steps 10000 \
        --save_total_limit 5 \
        --evaluate_during_training \
        --logging_steps 500 \
        --fp16 \
        --overwrite_cache

MonoModel-MbertTok

Show

You can start pretraining a MonoModel-MbertTok model as shown below.

Important:

Ensure to pass --overwrite_cache because many of the wiki files have the same names and will otherwise not be read correctly
For the Finnish MonoModel-MbertTok, additionally pass the flag --whole_word_mask and replace tokenized-texts by filtered-texts in the data path
You can use the bert-base-multilingual tokenizer. However, we have reduced the vocabulary of mBERT's tokenizer using reduce_tokenizer.py to facilitate training. The script contains usage instructions. We have also uploaded our reduced tokenizers to reduced_tokenizers.

# Example for Korean, working directory is repo's main folder
export LANG="ko"
export TOKENIZER="pretraining/reduced_tokenizers/$LANG-mbert-reduced"

# First phase of pretraining with block size 128
python pretraining/run_pretraining.py \
        --output_dir $LANG-monomodel-mberttok-p1 \
        --model_type bert \
        --tokenizer_name $TOKENIZER \
        --do_train \
        --train_data_files lib/wiki-bert-pipeline/data/$LANG/tokenized-texts \
        --do_eval \
        --eval_data_file lib/wiki-bert-pipeline/data/$LANG/eval_file \
        --mlm \
        --max_steps 900000 \
        --warmup_steps 10000 \
        --weight_decay 0.01 \
        --learning_rate 1e-4 \
        --per_device_train_batch_size 64 \
        --gradient_accumulation_steps 1 \
        --block_size 128 \
        --save_steps 10000 \
        --save_total_limit 5 \
        --evaluate_during_training \
        --logging_steps 500 \
        --fp16 \
        --overwrite_cache
     
# Second phase of pretraining with block size 512
python pretraining/run_pretraining.py \
        --output_dir $LANG-monomodel-mberttok-p2 \
        --model_type bert \
        --model_name_or_path $LANG-monomodel-mberttok-p1 \
        --tokenizer_name $TOKENIZER \
        --do_train \
        --train_data_files lib/wiki-bert-pipeline/data/$LANG/tokenized-texts \
        --do_eval \
        --eval_data_file lib/wiki-bert-pipeline/data/$LANG/eval_file \
        --mlm \
        --max_steps 100000 \
        --warmup_steps 10000 \
        --weight_decay 0.01 \
        --learning_rate 1e-4 \
        --per_device_train_batch_size 16 \
        --gradient_accumulation_steps 4 \
        --block_size 512 \
        --save_steps 10000 \
        --save_total_limit 5 \
        --evaluate_during_training \
        --logging_steps 500 \
        --fp16 \
        --overwrite_cache

MbertModel-MonoTok

Show

You can start pretraining an MbertModel-MonoTok model as shown below.

Important:

Ensure to pass --overwrite_cache because many of the wiki files have the same names and will otherwise not be read correctly
For the Finnish MbertModel-MonoTok, replace tokenized-texts by filtered-texts in the data path

# Example for Korean, working directory is repo's main folder
export LANG="ko"
export TOKENIZER="snunlp/KR-BERT-char16424"

python pretraining/run_pretraining.py \
        --output_dir $LANG-mbertmodel-monotok \
        --model_type bert \
        --model_name_or_path bert-base-multilingual-cased \
        --tokenizer_name $TOKENIZER \
        --do_train \
        --train_data_files lib/wiki-bert-pipeline/data/$LANG/tokenized-texts \
        --do_eval \
        --eval_data_file lib/wiki-bert-pipeline/data/$LANG/eval_file \
        --mlm \
        --max_steps 250000 \
        --warmup_steps 10000 \
        --weight_decay 0.01 \
        --learning_rate 1e-4 \
        --per_device_train_batch_size 16 \
        --gradient_accumulation_steps 4 \
        --block_size 512 \
        --save_steps 10000 \
        --save_total_limit 5 \
        --evaluate_during_training \
        --logging_steps 500 \
        --fp16 \
        --new_embeddings \
        --freeze_base_model \
        --overwrite_cache

MbertModel-MbertTok

Show

You can start pretraining a MbertModel-MbertTok model as shown below.

Important:

Ensure to pass --overwrite_cache because many of the wiki files have the same names and will otherwise not be read correctly
For the Finnish MbertModel-MbertTok, replace tokenized-texts by filtered-texts in the data path

# Example for Korean, working directory is repo's main folder
export LANG="ko"
export TOKENIZER="pretraining/reduced_tokenizers/$LANG-mbert-reduced"

python pretraining/run_pretraining.py \
        --output_dir $LANG-mbertmodel-mberttok \
        --model_type bert \
        --model_name_or_path bert-base-multilingual-cased \
        --tokenizer_name $TOKENIZER \
        --do_train \
        --train_data_files lib/wiki-bert-pipeline/data/$LANG/tokenized-texts \
        --do_eval \
        --eval_data_file lib/wiki-bert-pipeline/data/$LANG/eval_file \
        --mlm \
        --max_steps 250000 \
        --warmup_steps 10000 \
        --weight_decay 0.01 \
        --learning_rate 1e-4 \
        --per_device_train_batch_size 16 \
        --gradient_accumulation_steps 4 \
        --block_size 512 \
        --save_steps 10000 \
        --save_total_limit 5 \
        --evaluate_during_training \
        --logging_steps 500 \
        --fp16 \
        --new_embeddings \
        --freeze_base_model \
        --overwrite_cache

Pretraining with Adapters

Show

You can start pretraining an MbertModel-MonoTok model together with an injected language adapter as shown below.

Important:

Ensure to pass --overwrite_cache because many of the wiki files have the same names and will otherwise not be read correctly
For the Finnish model, replace tokenized-texts by filtered-texts in the data path

# Example for Korean, working directory is repo's main folder
export LANG="ko"
export TOKENIZER="snunlp/KR-BERT-char16424"

python pretraining/run_pretraining.py \
    --output_dir  \
    --model_type bert \
    --model_name_or_path bert-base-multilingual-cased \
    --tokenizer_name $TOKENIZER \
    --do_train \
    --train_data_files lib/wiki-bert-pipeline/data/$LANG/tokenized-texts \
    --do_eval \
    --eval_data_file lib/wiki-bert-pipeline/data/$LANG/eval_file \
    --mlm \
    --max_steps 250000 \
    --warmup_steps 10000 \
    --weight_decay 0.01 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --block_size 512 \
    --save_steps 10000 \
    --save_total_limit 5 \
    --evaluate_during_training \
    --logging_steps 500 \
    --fp16 \
    --new_embeddings \
    --freeze_base_model \
    --train_adapter \
    --language $LANG \
    --overwrite_cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pretraining

pretraining

README.md

Pretraining

Data

Model Training

MonoModel-MonoTok

MonoModel-MbertTok

MbertModel-MonoTok

MbertModel-MbertTok

Pretraining with Adapters

Files

pretraining

Directory actions

More options

Directory actions

More options

Latest commit

History

pretraining

Folders and files

parent directory

README.md

Pretraining

Data

Model Training

MonoModel-MonoTok

MonoModel-MbertTok

MbertModel-MonoTok

MbertModel-MbertTok

Pretraining with Adapters