Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[BERT] Distributed Training Support #478

Closed
eric-haibin-lin opened this issue Dec 21, 2018 · 6 comments
Closed

[BERT] Distributed Training Support #478

eric-haibin-lin opened this issue Dec 21, 2018 · 6 comments

Comments

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Dec 21, 2018

As title.
Related issues:
apache/mxnet#14124
apache/incubator-mxnet#14073
apache/incubator-mxnet#14072
apache/mxnet#11061
apache/mxnet#14125
apache/mxnet#14126

@ymjiang
Copy link
Member

ymjiang commented Jan 25, 2019

@eric-haibin-lin
Hi Eric, may I know if there is any recent progress towards distributed training support? I am especially interested in training BERT and Transformer in distributed way. I did some preliminary modification to train_transformer.py by adding kvstore as a param to the gluon.Trainer, but I am not sure if this is the correct way to do. Besides, I haven't try to split the dataset yet.

Perhaps any support from official codes will be better and reliable. Thanks.

@eric-haibin-lin
Copy link
Member Author

Hi @ymjiang I have some code locally and I am still testing it. In general you need to:

  • Create SplitSampler and DatasetStream for Dataloader to sample a subset of the dataset
  • Create Trainer with “dist_sync_device” KVStore and LRScheduler
  • Adjust trainer.step based on global batch size
    I am thinking about writing a tutorial, too. But that will happen after I'm done with my current task. Feel free to ask questions if any

@ymjiang
Copy link
Member

ymjiang commented Jan 27, 2019

@eric-haibin-lin Thank you very much for the tips. I have another question about changing the FixedBucketSampler (which is used in training Transformer) into SplitSampler. Will that affect the training (e.g., on accuracy)? I am not sure whether the FixBucket is an important mechanism for NLP task. Forgive me if I ask a silly question.

@eric-haibin-lin
Copy link
Member Author

Hi @ymjiang , BucketSampler is a way to create data batches of similar lengths for a dataset. https://github.com/szha/KDD18-Gluon/blob/master/05_data_pipeline/2-data-pipeline.ipynb Gluonnlp has the concept of "streams". In particular, dataset stream is an iterator which loads a dataset(typically a file) at a time. If your training data consists of multiple files, you can pass the SplitSampler to DatasetStream so that each machine iterate through a subset of files

@ymjiang
Copy link
Member

ymjiang commented Jan 29, 2019

@eric-haibin-lin I will take a look, thanks!

@eric-haibin-lin
Copy link
Member Author

Added in #665

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants