The code of pre-training CPT is based on Megatron-LM.
For Setup, Data Processing of CPT, you can refer to the README of Megatron-LM. And the package jieba_fast is needed for Whole Word Masking pre-training.
Firstly, prepare files in the following folders:
dataset/
: Place the.bin
and.idx
files that preprocessed from raw text.vocab/
: Place the vocab files and model config file.roberta_zh/
: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint.
Then, use the scripts run_pretrain_bart.sh
and run_pretrain_cpt.sh
to train Chinese BART and CPT, respectively.
NOTE: the training scripts is distributed examples for 8 GPUs. You may alter the number of GPUs and change the training steps to meet the need.
- Add
bart_model
andcpt_model
for Megatron undermegatron/model
, to let Megatron can train on BART and CPT. - Add
_HfBertTokenizer
inmegatron/tokenizer/tokenizer.py
to let Megatron can use Tokenizers from Huggingface-Transformers. - Add
bart_dataset
andcpt_dataset
undermegatron/data
to produce data for Whole Word Masking (WWM) and Denoising Auto-Encoder (DAE) pre-training. - Add
tools/convert_ckpt.py
to convert Megatron checkpoints to Huggingface-Transformers format. - Add
tools/preprocess_data.py
to preprocess and chunk large amount of text data into binary format used in Megatron.