-
Notifications
You must be signed in to change notification settings - Fork 538
[FEATURE]Horovod support for training transformer (PART 2) #1301
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1301 +/- ##
==========================================
+ Coverage 84.14% 84.45% +0.31%
==========================================
Files 42 42
Lines 6397 6422 +25
==========================================
+ Hits 5383 5424 +41
+ Misses 1014 998 -16
Continue to review full report at Codecov.
|
For the diff check failure, it might be caused by the fact that the exception is not triggered: gluon-nlp/src/gluonnlp/utils/misc.py Lines 562 to 568 in 32e87d4
Thus, we may try to add a test-case which downloads an non-existing file and use |
assuming no race issue in result upload, this can be a result of non-determinism in tests, and the code path in the same test can vary. if that's the case, we can mock the randomness to deterministically test each path. or if we still require the randomness, we can increase the trial number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]>
commit 7525618 Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:25:38 2020 +0800 Squashed commit of the following: commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]> commit 1403c6e Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:15:44 2020 +0800 update uncased_bert_large commit 733a4b6 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 20:16:39 2020 +0800 adjust uncased_bert_large commit 770f079 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 15:10:57 2020 +0800 Revert "merge xingjian's" This reverts commit ea1f1aa. commit fe74dda Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:07:36 2020 +0800 update electra small commit 8972343 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:00:57 2020 +0800 add command to readme commit 8fcde49 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:30:47 2020 +0800 revise commit 7a625c4 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:21:58 2020 +0800 update reamde commit 071c6dd Author: ZheyuYe <[email protected]> Date: Wed Aug 19 17:14:53 2020 +0800 update bert squad command commit ea1f1aa Author: ZheyuYe <[email protected]> Date: Tue Aug 18 18:07:01 2020 +0800 merge xingjian's commit 859ab4d Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:47:01 2020 +0800 dummy example commit 633e683 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:36:31 2020 +0800 list_backbone_names commit b4aac59 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:32:51 2020 +0800 update readme commit 54301d9 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:59:06 2020 +0800 revise batch squad commit e019e27 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:58:49 2020 +0800 bash convert commit e01eda0 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 11:10:51 2020 +0800 update roberta commit 1730ff7 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 10:15:27 2020 +0800 revise submit commit de0b4c9 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:07:58 2020 +0800 upload batch files commit 175de01 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:05:02 2020 +0800 fix commit 0460ed3 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 15:48:52 2020 +0800 upload commands
* Squashed commit of the following: commit 7525618 Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:25:38 2020 +0800 Squashed commit of the following: commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]> commit 1403c6e Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:15:44 2020 +0800 update uncased_bert_large commit 733a4b6 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 20:16:39 2020 +0800 adjust uncased_bert_large commit 770f079 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 15:10:57 2020 +0800 Revert "merge xingjian's" This reverts commit ea1f1aa. commit fe74dda Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:07:36 2020 +0800 update electra small commit 8972343 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:00:57 2020 +0800 add command to readme commit 8fcde49 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:30:47 2020 +0800 revise commit 7a625c4 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:21:58 2020 +0800 update reamde commit 071c6dd Author: ZheyuYe <[email protected]> Date: Wed Aug 19 17:14:53 2020 +0800 update bert squad command commit ea1f1aa Author: ZheyuYe <[email protected]> Date: Tue Aug 18 18:07:01 2020 +0800 merge xingjian's commit 859ab4d Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:47:01 2020 +0800 dummy example commit 633e683 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:36:31 2020 +0800 list_backbone_names commit b4aac59 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:32:51 2020 +0800 update readme commit 54301d9 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:59:06 2020 +0800 revise batch squad commit e019e27 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:58:49 2020 +0800 bash convert commit e01eda0 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 11:10:51 2020 +0800 update roberta commit 1730ff7 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 10:15:27 2020 +0800 revise submit commit de0b4c9 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:07:58 2020 +0800 upload batch files commit 175de01 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:05:02 2020 +0800 fix commit 0460ed3 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 15:48:52 2020 +0800 upload commands * add mobilebert * replace remote * fix branch * fix typo Co-authored-by: Yuma1L <[email protected]>
Checklist
Essentials
Changes