Horovod support for pretraining and fune-tuning squad #1276

zheyuye · 2020-07-24T10:26:53Z

Description

Thanks @eric-haibin-lin for providing Horovod support in MXNet 2.0, see https://github.com/eric-haibin-lin/horovod/tree/mx2. Based on this, a sight modification were made to support the use of Horovod in the GluonNLP number version.

Changes

Refactor the prepare_wikipedia.py scripts with multiprocessing and uplaod the 2020-06-20 version to S3
Ensure the structural consistency of the pre-processing corpus which can be directly loaded into the pre-training script as --data 'pretraining_data/pretraining_data/*.train,pretraining_data/prepared_wikipedia/*.txt'
Refactor the pretraining and squad fune-tuning scripts to support horovod
Move layer-wise learning rate decay and froze parameters approch into backbone model (electra)
Speed up electra with the help of mx.npx.index_update implemented in add op npx.index_update apache/mxnet#18545
upload gluon_electra_small_owt to S3

Comments

@sxjscience @hymzoque

commit 35a586676036f627bffd0d3c753c6cd0a70d63cf Author: ZheyuYe <[email protected]> Date: Fri Jul 17 10:10:14 2020 +0800 Squashed commit of the following: commit 673344d Author: ZheyuYe <[email protected]> Date: Wed Jul 15 22:43:07 2020 +0800 CharTokenizer commit 8dabfd6 Author: ZheyuYe <[email protected]> Date: Wed Jul 15 15:47:24 2020 +0800 lowercase commit f5c94a6 Author: ZheyuYe <[email protected]> Date: Tue Jul 14 17:45:28 2020 +0800 test commit dc55fc9 Author: ZheyuYe <[email protected]> Date: Tue Jul 14 05:45:01 2020 +0800 tiny update on run_squad commit 4defc7a Author: ZheyuYe <[email protected]> Date: Mon Jul 13 23:18:08 2020 +0800 update testings commit 2719e81 Author: ZheyuYe <[email protected]> Date: Mon Jul 13 23:08:32 2020 +0800 re-upload xlmr commit cd0509d Author: ZheyuYe <[email protected]> Date: Mon Jul 13 22:30:47 2020 +0800 fix get_pretrained commit 8ed8a72 Author: ZheyuYe <[email protected]> Date: Mon Jul 13 22:28:13 2020 +0800 re-upload roberta commit 5811d40 Author: ZheyuYe <[email protected]> Date: Mon Jul 13 18:27:23 2020 +0800 update commit 44a09a3 Author: ZheyuYe <[email protected]> Date: Sat Jul 11 15:06:33 2020 +0800 fix commit 4074a26 Author: ZheyuYe <[email protected]> Date: Fri Jul 10 16:08:49 2020 +0800 inference without horovod commit 31cb953 Author: ZheyuYe <[email protected]> Date: Thu Jul 9 18:41:55 2020 +0800 update commit 838be2a Author: ZheyuYe <[email protected]> Date: Thu Jul 9 15:14:39 2020 +0800 horovod for squad commit 1d374a2 Author: ZheyuYe <[email protected]> Date: Thu Jul 9 12:09:19 2020 +0800 fix commit e4fba39 Author: ZheyuYe <[email protected]> Date: Thu Jul 9 10:35:08 2020 +0800 remove multiply_grads commit 007f07e Author: ZheyuYe <[email protected]> Date: Tue Jul 7 11:26:38 2020 +0800 multiply_grads commit b8c85bb Author: ZheyuYe <[email protected]> Date: Mon Jul 6 12:28:56 2020 +0800 fix ModelForQABasic commit 0e13a58 Author: ZheyuYe <[email protected]> Date: Sat Jul 4 18:42:12 2020 +0800 clip_grad_global_norm with zeros max_grad_norm commit bd270f2 Author: ZheyuYe <[email protected]> Date: Fri Jul 3 20:21:31 2020 +0800 fix roberta commit 4fc564c Author: ZheyuYe <[email protected]> Date: Fri Jul 3 19:36:08 2020 +0800 update hyper-parameters of adamw commit 59cffbf Author: ZheyuYe <[email protected]> Date: Fri Jul 3 16:25:46 2020 +0800 try commit a84f782 Author: ZheyuYe <[email protected]> Date: Thu Jul 2 20:39:03 2020 +0800 fix mobilebert commit 4bc3a96 Author: ZheyuYe <[email protected]> Date: Thu Jul 2 11:14:39 2020 +0800 layer-wise decay commit 07186d5 Author: ZheyuYe <[email protected]> Date: Thu Jul 2 02:14:43 2020 +0800 revise commit a5a6475 Author: ZheyuYe <[email protected]> Date: Wed Jul 1 19:50:20 2020 +0800 topk commit 34ee884 Author: ZheyuYe <[email protected]> Date: Wed Jul 1 19:25:09 2020 +0800 index_update commit 74178e2 Author: ZheyuYe <[email protected]> Date: Wed Jul 1 00:48:32 2020 +0800 rename commit fa011aa Author: ZheyuYe <[email protected]> Date: Tue Jun 30 23:40:28 2020 +0800 update commit 402d625 Author: ZheyuYe <[email protected]> Date: Tue Jun 30 21:40:30 2020 +0800 multiprocessing for wiki commit ddbde75 Author: ZheyuYe <[email protected]> Date: Tue Jun 30 20:41:35 2020 +0800 fix bookcorpus commit 6cc5ccd Author: ZheyuYe <[email protected]> Date: Tue Jun 30 16:39:12 2020 +0800 fix wiki commit 9773efd Author: ZheyuYe <[email protected]> Date: Tue Jun 30 15:52:13 2020 +0800 fix openwebtext commit 1fb8eb8 Author: ZheyuYe <[email protected]> Date: Mon Jun 29 19:51:25 2020 +0800 upload gluon_electra_small_owt commit ca83fac Author: ZheyuYe <[email protected]> Date: Mon Jun 29 18:09:48 2020 +0800 revise train_transformer commit 1450f5c Author: ZheyuYe <[email protected]> Date: Mon Jun 29 18:07:04 2020 +0800 revise commit b460bbe Author: ZheyuYe <[email protected]> Date: Mon Jun 29 17:24:00 2020 +0800 repeat for pretraining commit 8ee381b Author: ZheyuYe <[email protected]> Date: Mon Jun 29 17:06:43 2020 +0800 repeat commit aea936f Author: ZheyuYe <[email protected]> Date: Mon Jun 29 16:39:22 2020 +0800 fix mobilebert commit eead164 Author: ZheyuYe <[email protected]> Date: Sun Jun 28 18:44:28 2020 +0800 fix commit 8645115 Author: ZheyuYe <[email protected]> Date: Sun Jun 28 17:27:43 2020 +0800 update commit 2b7f7a3 Author: ZheyuYe <[email protected]> Date: Sun Jun 28 17:18:00 2020 +0800 fix roberta commit 86702fe Author: ZheyuYe <[email protected]> Date: Sun Jun 28 16:27:43 2020 +0800 use_segmentation commit 6d03d7a Author: ZheyuYe <[email protected]> Date: Sun Jun 28 15:52:40 2020 +0800 fix commit 5c0ca43 Author: ZheyuYe <[email protected]> Date: Sun Jun 28 15:49:48 2020 +0800 fix token_ids commit ff7aae8 Author: ZheyuYe <[email protected]> Date: Sun Jun 28 13:56:07 2020 +0800 fix xlmr commit 2070b86 Author: ZheyuYe <[email protected]> Date: Sun Jun 28 13:54:26 2020 +0800 fix roberta commit 70a1887 Author: Leonard Lausen <[email protected]> Date: Fri Jul 17 00:07:08 2020 +0000 Update for Block API (dmlc#1261) - Remove params and prefix arguments for MXNet 2 and update parameter sharing implementation - Remove Block.name_scope() for MXNet 2 - Remove self.params.get() and self.params.get_constant() commit ea9152b Author: Xingjian Shi <[email protected]> Date: Thu Jul 16 15:42:04 2020 -0700 Fixes to make the CI more stable (dmlc#1265) * Some fixes to make the CI more stable * add retries * Update tokenizers.py commit a646c34 Author: ht <[email protected]> Date: Sun Jul 12 02:49:53 2020 +0800 [FEATURE] update backtranslation and add multinomial sampler (dmlc#1259) * back translation bash * split "lang-pair" para in clean_tok_para_corpus * added clean_tok_mono_corpus * fix * add num_process para * fix * fix * add yml * rm yml * update cfg name * update evaluate * added max_update / save_interval_update params * fix * fix * multi gpu inference * fix * update * update multi gpu inference * fix * fix * split evaluate and parallel infer * fix * test * fix * update * add comments * fix * remove todo comment * revert remove todo comment * raw lines remove duplicated '\n' * update multinomaial sampler * fix * fix * fix * fix * sampling * update script * fix * add test_case with k > 1 in topk sampling * fix multinomial sampler * update docs * comments situation eos_id = None * fix Co-authored-by: Hu <[email protected]> commit 83e1f13 Author: Leonard Lausen <[email protected]> Date: Thu Jul 9 20:57:55 2020 -0700 Use Amazon S3 Transfer Acceleration (dmlc#1260) commit cd48efd Author: Leonard Lausen <[email protected]> Date: Tue Jul 7 17:39:42 2020 -0700 Update codecov action to handle different OS and Python versions (dmlc#1254) codecov/codecov-action#80 (comment) commit 689eba9 Author: Sheng Zha <[email protected]> Date: Tue Jul 7 09:55:34 2020 -0700 [CI] AWS batch job tool for GluonNLP (Part I) (dmlc#1251) * AWS batch job tool for GluonNLP * limit range Co-authored-by: Xingjian Shi <[email protected]> commit e06ff01 Author: Leonard Lausen <[email protected]> Date: Tue Jul 7 08:36:24 2020 -0700 Pin mxnet version range on CI (dmlc#1257)

codecov · 2020-07-24T11:06:41Z

Codecov Report

Merging #1276 into numpy will decrease coverage by 0.82%.
The diff coverage is 16.21%.

@@            Coverage Diff             @@
##            numpy    #1276      +/-   ##
==========================================
- Coverage   84.21%   83.38%   -0.83%     
==========================================
  Files          42       42              
  Lines        6316     6375      +59     
==========================================
- Hits         5319     5316       -3     
- Misses        997     1059      +62

Impacted Files	Coverage Δ
src/gluonnlp/models/albert.py	`95.47% <ø> (ø)`
src/gluonnlp/models/bert.py	`94.86% <ø> (ø)`
src/gluonnlp/models/mobilebert.py	`87.94% <ø> (+0.22%)`	⬆️
src/gluonnlp/op.py	`60.00% <0.00%> (ø)`
src/gluonnlp/utils/parameter.py	`95.45% <ø> (ø)`
src/gluonnlp/utils/misc.py	`43.40% <5.71%> (-4.79%)`	⬇️
src/gluonnlp/models/electra.py	`77.09% <17.85%> (-3.69%)`	⬇️
src/gluonnlp/data/tokenizers.py	`77.69% <100.00%> (ø)`
src/gluonnlp/data/loading.py	`78.11% <0.00%> (-5.29%)`	⬇️
... and 1 more

scripts/question_answering/README.md

src/gluonnlp/models/electra.py

scripts/datasets/pretrain_corpus/prepare_wikipedia.py

scripts/pretraining/README.md

scripts/pretraining/data_preprocessing.py

sxjscience · 2020-08-01T07:06:29Z

@szhengac @ZiyueHuang This is merged and you may try pretraining with Electra + Horovod

zheyuye · 2020-08-28T10:00:50Z

scripts/pretraining/pretraining_utils.py

@@ -496,13 +494,17 @@ def dynamic_masking(self, F, input_ids, valid_lengths):
        valid_candidates = valid_candidates.astype(np.float32)
        num_masked_position = F.np.maximum(
            1, F.np.minimum(N, round(valid_lengths * self._mask_prob)))
-        # The categorical distribution takes normalized probabilities as input
-        # softmax is used here instead of log_softmax


log(sample_probs) instead of sample_probs should be used for npx.topk

zheyuye added 30 commits June 28, 2020 13:54

fix roberta

2070b86

fix xlmr

ff7aae8

fix token_ids

5c0ca43

fix

6d03d7a

use_segmentation

86702fe

fix roberta

2b7f7a3

update

8645115

fix

eead164

fix mobilebert

aea936f

repeat

8ee381b

repeat for pretraining

b460bbe

revise

1450f5c

revise train_transformer

ca83fac

upload gluon_electra_small_owt

1fb8eb8

fix openwebtext

9773efd

fix wiki

6cc5ccd

fix bookcorpus

ddbde75

multiprocessing for wiki

402d625

update

fa011aa

rename

74178e2

index_update

34ee884

topk

a5a6475

revise

07186d5

layer-wise decay

4bc3a96

fix mobilebert

a84f782

try

59cffbf

update hyper-parameters of adamw

4fc564c

fix roberta

bd270f2

clip_grad_global_norm with zeros max_grad_norm

0e13a58

fix ModelForQABasic

b8c85bb

zheyuye added 10 commits July 17, 2020 15:49

Merge remote-tracking branch 'origin/fix' into horovod

b5cf14f

frozen_params

5c6c0ce

remove conversion to a sperate pr

4af4869

fix

cba4808

Merge remote-tracking branch 'origin/numpy' into horovod

0eacca7

fix

8a952fa

update

f7217be

test

f0bdc4c

revise

bd48d9a

leezu reviewed Jul 24, 2020

View reviewed changes

scripts/question_answering/README.md Outdated Show resolved Hide resolved

update performance numbers

64f9e82

sxjscience reviewed Jul 27, 2020

View reviewed changes

src/gluonnlp/models/electra.py Outdated Show resolved Hide resolved

sxjscience reviewed Jul 27, 2020

View reviewed changes

src/gluonnlp/models/electra.py Outdated Show resolved Hide resolved

update apply_layerwisw_decay

2c35e46

sxjscience reviewed Jul 29, 2020

View reviewed changes

scripts/datasets/pretrain_corpus/prepare_wikipedia.py Show resolved Hide resolved

sxjscience reviewed Jul 29, 2020

View reviewed changes

scripts/pretraining/README.md Outdated Show resolved Hide resolved

sxjscience reviewed Jul 29, 2020

View reviewed changes

scripts/pretraining/data_preprocessing.py Outdated Show resolved Hide resolved

zheyuye added 2 commits July 30, 2020 15:45

use shuffle

2412eaa

Merge branch 'numpy' into horovod

33ac05b

zheyuye force-pushed the horovod branch from 56128ef to 33ac05b Compare July 31, 2020 02:43

zheyuye added 2 commits July 31, 2020 12:19

fix mobilebert

70d374a

fix vocab_file

19e6360

zheyuye force-pushed the horovod branch from 2593ab2 to 19e6360 Compare July 31, 2020 05:27

sxjscience approved these changes Aug 1, 2020

View reviewed changes

sxjscience merged commit 1f9ad44 into dmlc:numpy Aug 1, 2020

zheyuye deleted the horovod branch August 12, 2020 10:43

zheyuye commented Aug 28, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod support for pretraining and fune-tuning squad #1276

Horovod support for pretraining and fune-tuning squad #1276

zheyuye commented Jul 24, 2020 •

edited

Loading

codecov bot commented Jul 24, 2020 •

edited

Loading

sxjscience commented Aug 1, 2020

zheyuye Aug 28, 2020

Horovod support for pretraining and fune-tuning squad #1276

Horovod support for pretraining and fune-tuning squad #1276

Conversation

zheyuye commented Jul 24, 2020 • edited Loading

Description

Changes

Comments

codecov bot commented Jul 24, 2020 • edited Loading

Codecov Report

sxjscience commented Aug 1, 2020

zheyuye Aug 28, 2020

Choose a reason for hiding this comment

zheyuye commented Jul 24, 2020 •

edited

Loading

codecov bot commented Jul 24, 2020 •

edited

Loading