Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[FEATURE]Horovod support for training transformer (PART 2) #1301

Merged
merged 44 commits into from
Aug 20, 2020
Merged

[FEATURE]Horovod support for training transformer (PART 2) #1301

merged 44 commits into from
Aug 20, 2020

Conversation

hutao965
Copy link
Contributor

@hutao965 hutao965 commented Aug 17, 2020

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • add horovodrun example in machine_translation/README.md
  • transformer base/large result (test and valid bleu)
  • ShardedIterator (to split sampler) and testcase
  • add test case to test situation that fail downloading in test_util_misc.py

@codecov
Copy link

codecov bot commented Aug 17, 2020

Codecov Report

Merging #1301 into master will increase coverage by 0.31%.
The diff coverage is 97.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1301      +/-   ##
==========================================
+ Coverage   84.14%   84.45%   +0.31%     
==========================================
  Files          42       42              
  Lines        6397     6422      +25     
==========================================
+ Hits         5383     5424      +41     
+ Misses       1014      998      -16     
Impacted Files Coverage Δ
src/gluonnlp/data/sampler.py 96.55% <97.14%> (+0.32%) ⬆️
src/gluonnlp/utils/misc.py 52.53% <0.00%> (+0.63%) ⬆️
src/gluonnlp/data/loading.py 83.39% <0.00%> (+5.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32e87d4...db8afdb. Read the comment docs.

tests/test_data_sampler.py Outdated Show resolved Hide resolved
@sxjscience
Copy link
Member

For the diff check failure, it might be caused by the fact that the exception is not triggered:

except Exception as e:
retries -= 1
if retries <= 0:
raise e
print('download failed due to {}, retrying, {} attempt{} left'
.format(repr(e), retries, 's' if retries > 1 else ''))

Thus, we may try to add a test-case which downloads an non-existing file and use pytest.assertRaises.

@sxjscience
Copy link
Member

I find that the codecov is not very stable, especially the diff hit. @szha @leezu Would you know how could we improve it?

@szha
Copy link
Member

szha commented Aug 19, 2020

assuming no race issue in result upload, this can be a result of non-determinism in tests, and the code path in the same test can vary.

if that's the case, we can mock the randomness to deterministically test each path. or if we still require the randomness, we can increase the trial number

Copy link
Member

@sxjscience sxjscience left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sxjscience sxjscience merged commit 6ae558e into dmlc:master Aug 20, 2020
zheyuye added a commit to zheyuye/gluon-nlp that referenced this pull request Aug 21, 2020
commit d8b68c6
Author: Xingjian Shi <[email protected]>
Date:   Thu Aug 20 08:47:56 2020 -0700

    [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302)

    * Update submit-job.py

    Add LICESE + Examples for batch

    Update docker image

    update

    Update README.md

    Update README.md

    Update ubuntu18.04-devel.Dockerfile

    Update ubuntu18.04-devel.Dockerfile

    Update ubuntu18.04-devel.Dockerfile

    update

    Update ubuntu18.04-devel-gpu.Dockerfile

    fix

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update submit-job.py

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    update

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    update

    update

    Update submit-job.py

    Update submit-job.py

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    try to fix

    fix batch

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update submit-job.py

    Update ubuntu18.04-devel-gpu.Dockerfile

    simplify bert test

    add files

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    Update ubuntu18.04-devel-gpu.Dockerfile

    fix

    Update ubuntu18.04-devel-gpu.Dockerfile

    * Update ubuntu18.04-devel-gpu.Dockerfile

    * try to add back mxnet support

    * Update ubuntu18.04-devel-gpu.Dockerfile

    * Update ubuntu18.04-devel-gpu.Dockerfile

    * update

    * Update ubuntu18.04-devel-gpu.Dockerfile

    * Update ubuntu18.04-devel-gpu.Dockerfile

    * Update ubuntu18.04-devel-gpu.Dockerfile

    * fix issues

    * update

commit 6ae558e
Author: ht <[email protected]>
Date:   Thu Aug 20 23:47:30 2020 +0800

    [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301)

    * set default shuffle=True for boundedbudgetsampler

    * fix

    * fix log condition

    * use horovod to train transformer

    * fix

    * add mirror wmt dataset

    * fix

    * rename wmt.txt to wmt.json and remove part of urls

    * fix

    * tuning params

    * use get_repo_url()

    * update average checkpoint cli

    * paste result of transformer large

    * fix

    * fix logging in train_transformer

    * fix

    * fix

    * fix

    * add transformer base config

    * fix

    * change to wmt14/full

    * print more sacrebleu info

    * fix

    * add test for num_parts and update behavior of boundedbudgetsampler with even_size

    * fix

    * fix

    * fix

    * fix logging when using horovd

    * udpate doc of train transformer

    * add test case for fail downloading

    * add a ShardedIterator

    * fix

    * fix

    * fix

    * change mpirun to horovodrun

    * make the horovod command complete

    * use print(sampler) to cover the codes of __repr__ func

    * empty commit

    * add test case test_sharded_iterator_even_size

    Co-authored-by: Hu <[email protected]>
zheyuye added a commit to zheyuye/gluon-nlp that referenced this pull request Aug 21, 2020
commit 7525618
Author: ZheyuYe <[email protected]>
Date:   Fri Aug 21 11:25:38 2020 +0800

    Squashed commit of the following:

    commit d8b68c6
    Author: Xingjian Shi <[email protected]>
    Date:   Thu Aug 20 08:47:56 2020 -0700

        [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302)

        * Update submit-job.py

        Add LICESE + Examples for batch

        Update docker image

        update

        Update README.md

        Update README.md

        Update ubuntu18.04-devel.Dockerfile

        Update ubuntu18.04-devel.Dockerfile

        Update ubuntu18.04-devel.Dockerfile

        update

        Update ubuntu18.04-devel-gpu.Dockerfile

        fix

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update submit-job.py

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        update

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        update

        update

        Update submit-job.py

        Update submit-job.py

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        try to fix

        fix batch

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update submit-job.py

        Update ubuntu18.04-devel-gpu.Dockerfile

        simplify bert test

        add files

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        fix

        Update ubuntu18.04-devel-gpu.Dockerfile

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * try to add back mxnet support

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * update

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * fix issues

        * update

    commit 6ae558e
    Author: ht <[email protected]>
    Date:   Thu Aug 20 23:47:30 2020 +0800

        [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301)

        * set default shuffle=True for boundedbudgetsampler

        * fix

        * fix log condition

        * use horovod to train transformer

        * fix

        * add mirror wmt dataset

        * fix

        * rename wmt.txt to wmt.json and remove part of urls

        * fix

        * tuning params

        * use get_repo_url()

        * update average checkpoint cli

        * paste result of transformer large

        * fix

        * fix logging in train_transformer

        * fix

        * fix

        * fix

        * add transformer base config

        * fix

        * change to wmt14/full

        * print more sacrebleu info

        * fix

        * add test for num_parts and update behavior of boundedbudgetsampler with even_size

        * fix

        * fix

        * fix

        * fix logging when using horovd

        * udpate doc of train transformer

        * add test case for fail downloading

        * add a ShardedIterator

        * fix

        * fix

        * fix

        * change mpirun to horovodrun

        * make the horovod command complete

        * use print(sampler) to cover the codes of __repr__ func

        * empty commit

        * add test case test_sharded_iterator_even_size

        Co-authored-by: Hu <[email protected]>

commit 1403c6e
Author: ZheyuYe <[email protected]>
Date:   Fri Aug 21 11:15:44 2020 +0800

    update uncased_bert_large

commit 733a4b6
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 20:16:39 2020 +0800

    adjust uncased_bert_large

commit 770f079
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 15:10:57 2020 +0800

    Revert "merge xingjian's"

    This reverts commit ea1f1aa.

commit fe74dda
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 14:07:36 2020 +0800

    update electra small

commit 8972343
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 14:00:57 2020 +0800

    add command to readme

commit 8fcde49
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 12:30:47 2020 +0800

    revise

commit 7a625c4
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 12:21:58 2020 +0800

    update reamde

commit 071c6dd
Author: ZheyuYe <[email protected]>
Date:   Wed Aug 19 17:14:53 2020 +0800

    update bert squad command

commit ea1f1aa
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 18:07:01 2020 +0800

    merge xingjian's

commit 859ab4d
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 17:47:01 2020 +0800

    dummy example

commit 633e683
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 17:36:31 2020 +0800

    list_backbone_names

commit b4aac59
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 17:32:51 2020 +0800

    update readme

commit 54301d9
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 13:59:06 2020 +0800

    revise batch squad

commit e019e27
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 13:58:49 2020 +0800

    bash convert

commit e01eda0
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 11:10:51 2020 +0800

    update roberta

commit 1730ff7
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 10:15:27 2020 +0800

    revise submit

commit de0b4c9
Author: ZheyuYe <[email protected]>
Date:   Mon Aug 17 16:07:58 2020 +0800

    upload batch files

commit 175de01
Author: ZheyuYe <[email protected]>
Date:   Mon Aug 17 16:05:02 2020 +0800

    fix

commit 0460ed3
Author: ZheyuYe <[email protected]>
Date:   Mon Aug 17 15:48:52 2020 +0800

    upload commands
sxjscience pushed a commit that referenced this pull request Aug 22, 2020
* Squashed commit of the following:

commit 7525618
Author: ZheyuYe <[email protected]>
Date:   Fri Aug 21 11:25:38 2020 +0800

    Squashed commit of the following:

    commit d8b68c6
    Author: Xingjian Shi <[email protected]>
    Date:   Thu Aug 20 08:47:56 2020 -0700

        [Numpy] Fix AWS Batch + Add Docker Support (#1302)

        * Update submit-job.py

        Add LICESE + Examples for batch

        Update docker image

        update

        Update README.md

        Update README.md

        Update ubuntu18.04-devel.Dockerfile

        Update ubuntu18.04-devel.Dockerfile

        Update ubuntu18.04-devel.Dockerfile

        update

        Update ubuntu18.04-devel-gpu.Dockerfile

        fix

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update submit-job.py

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        update

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        update

        update

        Update submit-job.py

        Update submit-job.py

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        try to fix

        fix batch

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update submit-job.py

        Update ubuntu18.04-devel-gpu.Dockerfile

        simplify bert test

        add files

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        Update ubuntu18.04-devel-gpu.Dockerfile

        fix

        Update ubuntu18.04-devel-gpu.Dockerfile

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * try to add back mxnet support

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * update

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * Update ubuntu18.04-devel-gpu.Dockerfile

        * fix issues

        * update

    commit 6ae558e
    Author: ht <[email protected]>
    Date:   Thu Aug 20 23:47:30 2020 +0800

        [FEATURE]Horovod support for training transformer (PART 2) (#1301)

        * set default shuffle=True for boundedbudgetsampler

        * fix

        * fix log condition

        * use horovod to train transformer

        * fix

        * add mirror wmt dataset

        * fix

        * rename wmt.txt to wmt.json and remove part of urls

        * fix

        * tuning params

        * use get_repo_url()

        * update average checkpoint cli

        * paste result of transformer large

        * fix

        * fix logging in train_transformer

        * fix

        * fix

        * fix

        * add transformer base config

        * fix

        * change to wmt14/full

        * print more sacrebleu info

        * fix

        * add test for num_parts and update behavior of boundedbudgetsampler with even_size

        * fix

        * fix

        * fix

        * fix logging when using horovd

        * udpate doc of train transformer

        * add test case for fail downloading

        * add a ShardedIterator

        * fix

        * fix

        * fix

        * change mpirun to horovodrun

        * make the horovod command complete

        * use print(sampler) to cover the codes of __repr__ func

        * empty commit

        * add test case test_sharded_iterator_even_size

        Co-authored-by: Hu <[email protected]>

commit 1403c6e
Author: ZheyuYe <[email protected]>
Date:   Fri Aug 21 11:15:44 2020 +0800

    update uncased_bert_large

commit 733a4b6
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 20:16:39 2020 +0800

    adjust uncased_bert_large

commit 770f079
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 15:10:57 2020 +0800

    Revert "merge xingjian's"

    This reverts commit ea1f1aa.

commit fe74dda
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 14:07:36 2020 +0800

    update electra small

commit 8972343
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 14:00:57 2020 +0800

    add command to readme

commit 8fcde49
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 12:30:47 2020 +0800

    revise

commit 7a625c4
Author: ZheyuYe <[email protected]>
Date:   Thu Aug 20 12:21:58 2020 +0800

    update reamde

commit 071c6dd
Author: ZheyuYe <[email protected]>
Date:   Wed Aug 19 17:14:53 2020 +0800

    update bert squad command

commit ea1f1aa
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 18:07:01 2020 +0800

    merge xingjian's

commit 859ab4d
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 17:47:01 2020 +0800

    dummy example

commit 633e683
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 17:36:31 2020 +0800

    list_backbone_names

commit b4aac59
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 17:32:51 2020 +0800

    update readme

commit 54301d9
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 13:59:06 2020 +0800

    revise batch squad

commit e019e27
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 13:58:49 2020 +0800

    bash convert

commit e01eda0
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 11:10:51 2020 +0800

    update roberta

commit 1730ff7
Author: ZheyuYe <[email protected]>
Date:   Tue Aug 18 10:15:27 2020 +0800

    revise submit

commit de0b4c9
Author: ZheyuYe <[email protected]>
Date:   Mon Aug 17 16:07:58 2020 +0800

    upload batch files

commit 175de01
Author: ZheyuYe <[email protected]>
Date:   Mon Aug 17 16:05:02 2020 +0800

    fix

commit 0460ed3
Author: ZheyuYe <[email protected]>
Date:   Mon Aug 17 15:48:52 2020 +0800

    upload commands

* add mobilebert

* replace remote

* fix branch

* fix typo

Co-authored-by: Yuma1L <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants