[BERT] Distributed Training Support #478

eric-haibin-lin · 2018-12-21T20:16:25Z

As title.
Related issues:
apache/mxnet#14124
apache/incubator-mxnet#14073
apache/incubator-mxnet#14072
apache/mxnet#11061
apache/mxnet#14125
apache/mxnet#14126

ymjiang · 2019-01-25T10:20:26Z

@eric-haibin-lin
Hi Eric, may I know if there is any recent progress towards distributed training support? I am especially interested in training BERT and Transformer in distributed way. I did some preliminary modification to train_transformer.py by adding kvstore as a param to the gluon.Trainer, but I am not sure if this is the correct way to do. Besides, I haven't try to split the dataset yet.

Perhaps any support from official codes will be better and reliable. Thanks.

eric-haibin-lin · 2019-01-25T20:04:55Z

Hi @ymjiang I have some code locally and I am still testing it. In general you need to:

Create SplitSampler and DatasetStream for Dataloader to sample a subset of the dataset
Create Trainer with “dist_sync_device” KVStore and LRScheduler
Adjust trainer.step based on global batch size
I am thinking about writing a tutorial, too. But that will happen after I'm done with my current task. Feel free to ask questions if any

ymjiang · 2019-01-27T02:23:42Z

@eric-haibin-lin Thank you very much for the tips. I have another question about changing the FixedBucketSampler (which is used in training Transformer) into SplitSampler. Will that affect the training (e.g., on accuracy)? I am not sure whether the FixBucket is an important mechanism for NLP task. Forgive me if I ask a silly question.

eric-haibin-lin · 2019-01-28T08:22:54Z

Hi @ymjiang , BucketSampler is a way to create data batches of similar lengths for a dataset. https://github.com/szha/KDD18-Gluon/blob/master/05_data_pipeline/2-data-pipeline.ipynb Gluonnlp has the concept of "streams". In particular, dataset stream is an iterator which loads a dataset(typically a file) at a time. If your training data consists of multiple files, you can pass the SplitSampler to DatasetStream so that each machine iterate through a subset of files

ymjiang · 2019-01-29T02:58:01Z

@eric-haibin-lin I will take a look, thanks!

eric-haibin-lin · 2019-05-18T17:55:19Z

Added in #665

eric-haibin-lin mentioned this issue Dec 30, 2018

[Feature] Add split sampler for distributed training #494

Merged

6 tasks

eric-haibin-lin added this to the BERT Multi-machine Pre-training milestone Jan 4, 2019

eric-haibin-lin closed this as completed May 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BERT] Distributed Training Support #478

[BERT] Distributed Training Support #478

eric-haibin-lin commented Dec 21, 2018 •

edited

Loading

ymjiang commented Jan 25, 2019 •

edited

Loading

eric-haibin-lin commented Jan 25, 2019

ymjiang commented Jan 27, 2019

eric-haibin-lin commented Jan 28, 2019

ymjiang commented Jan 29, 2019

eric-haibin-lin commented May 18, 2019

[BERT] Distributed Training Support #478

[BERT] Distributed Training Support #478

Comments

eric-haibin-lin commented Dec 21, 2018 • edited Loading

ymjiang commented Jan 25, 2019 • edited Loading

eric-haibin-lin commented Jan 25, 2019

ymjiang commented Jan 27, 2019

eric-haibin-lin commented Jan 28, 2019

ymjiang commented Jan 29, 2019

eric-haibin-lin commented May 18, 2019

eric-haibin-lin commented Dec 21, 2018 •

edited

Loading

ymjiang commented Jan 25, 2019 •

edited

Loading