Skip to content

cgmhaicenter/exBERT

Repository files navigation

exBERT

The details of the model is in paper.

Pre-train an exBERT model (only the extension part)

In command line:

python Pretraining.py -e 1 
  -b 256 
  -sp path_to_storage
  -dv 0 1 2 3 -lr 1e-04 
  -str exBERT    
  -config path_to_config_file_of_the_OFF_THE_SHELF_MODEL ./config_and_vocab/exBERT/bert_config_ex_s3.json  
  -vocab ./config_and_vocab/exBERT/exBERT_vocab.txt 
  -pm_p path_to_state_dict_of_the_OFF_THE_SHELF_MODEL
  -dp path_to_your_training_data
  -ls 128 
  -p 1

You can replace the path_to_config_file_of_the_OFF_THE_SHELF_MODEL and path_to_state_dict_of_the_OFF_THE_SHELF_MODEL to any weel pre-trained model in BERT archietecture. ./config_and_vocab/exBERT/bert_config_ex_s3.json defines the size of extension module.

Pre-train an exBERT model (whole model)

python Pretraining.py -e 1 
  -b 256 
  -sp path_to_storage
  -dv 0 1 2 3 -lr 1e-04 
  -str exBERT    
  -config path_to_config_file_of_the_OFF_THE_SHELF_MODEL ./config_and_vocab/exBERT/bert_config_ex_s3.json  
  -vocab ./config_and_vocab/exBERT/exBERT_vocab.txt 
  -pm_p path_to_state_dict_of_the_OFF_THE_SHELF_MODEL
  -dp path_to_your_training_data
  -ls 128 
  -p 1
  -t_ex_only ""

-t_ex_only "" enable training the whole model

Pre-train an exBERT with no extension of vocab

python Pretraining.py -e 1 
  -b 256 
  -sp path_to_storage
  -dv 0 1 2 3 -lr 1e-04 
  -str exBERT    
  -config path_to_config_file_of_the_OFF_THE_SHELF_MODEL config_and_vocab/exBERT_no_ex_vocab/bert_config_ex_s3.json
  -vocab path_to_vocab_file_of_the_OFF_THE_SHELF_MODEL
  -pm_p path_to_state_dict_of_the_OFF_THE_SHELF_MODEL
  -dp path_to_your_training_data
  -ls 128 
  -p 1
  -t_ex_only ""

Data preparation

Input data for pre-training script should be a .pkl file which contains two a list with two elements, e.g. [list1, list2]. list1 and list2 should contains the sentences like [CLS] sentence A [SEP] sentence B [SEP]. The only differnece between list1 and list2 is the relationship between sentence A and sentence B is IsNext or NotNext. Please check example_data.pkl

We also provide a simple script to generate the data from raw text file. python data_preprocess.py -voc path_to_vocab_file -ls 128 -dp path_to_txt_file -n_c 5 -rd 1 -sp ./your_data.pkl replace 128 to the max length limit you want try python data_preprocess.py -voc ./exBERT_vocab.txt -ls 128 -dp ./example_raw_text.txt -n_c 5 -rd 1 -sp ./example_data.pkl

Or you can do your own data preparation and organize the data with the format metioned above.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages