[FEATURE]Horovod support for training transformer (PART 2) #1301

hutao965 · 2020-08-17T08:52:35Z

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

add horovodrun example in machine_translation/README.md
transformer base/large result (test and valid bleu)
ShardedIterator (to split sampler) and testcase
add test case to test situation that fail downloading in test_util_misc.py

Numpy

…th even_size

codecov · 2020-08-17T09:46:30Z

Codecov Report

Merging #1301 into master will increase coverage by 0.31%.
The diff coverage is 97.14%.

@@            Coverage Diff             @@
##           master    #1301      +/-   ##
==========================================
+ Coverage   84.14%   84.45%   +0.31%     
==========================================
  Files          42       42              
  Lines        6397     6422      +25     
==========================================
+ Hits         5383     5424      +41     
+ Misses       1014      998      -16

Impacted Files	Coverage Δ
src/gluonnlp/data/sampler.py	`96.55% <97.14%> (+0.32%)`	⬆️
src/gluonnlp/utils/misc.py	`52.53% <0.00%> (+0.63%)`	⬆️
src/gluonnlp/data/loading.py	`83.39% <0.00%> (+5.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32e87d4...db8afdb. Read the comment docs.

scripts/machine_translation/README.md

scripts/datasets/machine_translation/wmt2014_ende.sh

scripts/machine_translation/README.md

scripts/machine_translation/train_transformer.py

tests/test_data_sampler.py

scripts/machine_translation/train_transformer.py

src/gluonnlp/data/sampler.py

sxjscience · 2020-08-17T19:41:19Z

For the diff check failure, it might be caused by the fact that the exception is not triggered:

gluon-nlp/src/gluonnlp/utils/misc.py

Lines 562 to 568 in 32e87d4

    
           except Exception as e: 
        
               retries -= 1 
        
               if retries <= 0: 
        
                   raise e 
        
               print('download failed due to {}, retrying, {} attempt{} left' 
        
                     .format(repr(e), retries, 's' if retries > 1 else ''))

Thus, we may try to add a test-case which downloads an non-existing file and use pytest.assertRaises.

scripts/machine_translation/README.md

sxjscience · 2020-08-19T06:59:22Z

I find that the codecov is not very stable, especially the diff hit. @szha @leezu Would you know how could we improve it?

szha · 2020-08-19T07:17:16Z

assuming no race issue in result upload, this can be a result of non-determinism in tests, and the code path in the same test can vary.

if that's the case, we can mock the randomness to deterministically test each path. or if we still require the randomness, we can increase the trial number

scripts/machine_translation/README.md

sxjscience

LGTM

commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]>

commit 7525618 Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:25:38 2020 +0800 Squashed commit of the following: commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]> commit 1403c6e Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:15:44 2020 +0800 update uncased_bert_large commit 733a4b6 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 20:16:39 2020 +0800 adjust uncased_bert_large commit 770f079 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 15:10:57 2020 +0800 Revert "merge xingjian's" This reverts commit ea1f1aa. commit fe74dda Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:07:36 2020 +0800 update electra small commit 8972343 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:00:57 2020 +0800 add command to readme commit 8fcde49 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:30:47 2020 +0800 revise commit 7a625c4 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:21:58 2020 +0800 update reamde commit 071c6dd Author: ZheyuYe <[email protected]> Date: Wed Aug 19 17:14:53 2020 +0800 update bert squad command commit ea1f1aa Author: ZheyuYe <[email protected]> Date: Tue Aug 18 18:07:01 2020 +0800 merge xingjian's commit 859ab4d Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:47:01 2020 +0800 dummy example commit 633e683 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:36:31 2020 +0800 list_backbone_names commit b4aac59 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:32:51 2020 +0800 update readme commit 54301d9 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:59:06 2020 +0800 revise batch squad commit e019e27 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:58:49 2020 +0800 bash convert commit e01eda0 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 11:10:51 2020 +0800 update roberta commit 1730ff7 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 10:15:27 2020 +0800 revise submit commit de0b4c9 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:07:58 2020 +0800 upload batch files commit 175de01 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:05:02 2020 +0800 fix commit 0460ed3 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 15:48:52 2020 +0800 upload commands

* Squashed commit of the following: commit 7525618 Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:25:38 2020 +0800 Squashed commit of the following: commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]> commit 1403c6e Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:15:44 2020 +0800 update uncased_bert_large commit 733a4b6 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 20:16:39 2020 +0800 adjust uncased_bert_large commit 770f079 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 15:10:57 2020 +0800 Revert "merge xingjian's" This reverts commit ea1f1aa. commit fe74dda Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:07:36 2020 +0800 update electra small commit 8972343 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:00:57 2020 +0800 add command to readme commit 8fcde49 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:30:47 2020 +0800 revise commit 7a625c4 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:21:58 2020 +0800 update reamde commit 071c6dd Author: ZheyuYe <[email protected]> Date: Wed Aug 19 17:14:53 2020 +0800 update bert squad command commit ea1f1aa Author: ZheyuYe <[email protected]> Date: Tue Aug 18 18:07:01 2020 +0800 merge xingjian's commit 859ab4d Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:47:01 2020 +0800 dummy example commit 633e683 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:36:31 2020 +0800 list_backbone_names commit b4aac59 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:32:51 2020 +0800 update readme commit 54301d9 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:59:06 2020 +0800 revise batch squad commit e019e27 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:58:49 2020 +0800 bash convert commit e01eda0 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 11:10:51 2020 +0800 update roberta commit 1730ff7 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 10:15:27 2020 +0800 revise submit commit de0b4c9 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:07:58 2020 +0800 upload batch files commit 175de01 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:05:02 2020 +0800 fix commit 0460ed3 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 15:48:52 2020 +0800 upload commands * add mobilebert * replace remote * fix branch * fix typo Co-authored-by: Yuma1L <[email protected]>

Hu and others added 30 commits July 28, 2020 12:31

set default shuffle=True for boundedbudgetsampler

c564420

fix

7f27b85

fix log condition

20d2fe1

use horovod to train transformer

54a1abf

Merge pull request #1 from dmlc/numpy

5b39c75

Numpy

fix

5685601

add mirror wmt dataset

5815f12

fix

4001821

Merge pull request #2 from dmlc/numpy

e280a65

Numpy

rename wmt.txt to wmt.json and remove part of urls

5ba3789

fix

c99503f

tuning params

cf8bcd3

Merge branch 'numpy' of https://github.com/hymzoque/gluon-nlp into numpy

3c1c5c0

use get_repo_url()

cc760d4

update average checkpoint cli

93243de

paste result of transformer large

73b942e

fix

3a969d5

fix logging in train_transformer

48d1fb9

fix

983d1ab

fix

9f7c087

fix

7dae5ad

add transformer base config

af18aca

fix

eadf4db

Merge branch 'numpy' of https://github.com/dmlc/gluon-nlp into numpy

192da91

change to wmt14/full

cfe7705

print more sacrebleu info

e200440

fix

a84a2d0

add test for num_parts and update behavior of boundedbudgetsampler wi…

681dc00

…th even_size

fix

179b8db

fix

eccceb7

hutao965 requested a review from sxjscience August 17, 2020 08:53

zheyuye reviewed Aug 17, 2020

View reviewed changes

scripts/machine_translation/README.md Show resolved Hide resolved