[MXNet] - [BERT] #690

araitats · 2019-05-02T21:48:45Z

There is a problem with a custom BERT model training with the later version of MXNet 1.5.0 (observed with cu90).
mlm_loss stops around 7.2X and nsp_acc stopps around 54.
The last mxnet-cu90 version is still viable is 1.5.0b20190425.
1.5.0b20190426 onward has this issue. Thus, you cannot train a custom BERT model with the latest version of MXNet now.

I assume there was a change in optimization between April 25th and 26th.

I used the latest version of gluonnlp for the following test. I think it is not the problem with gluonnlp (0.6.0).
(i.e. pip install https://github.com/dmlc/gluon-nlp/tarball/master )

With mxnet-cu90==1.5.0b20190425 (This is working)

(mxnet_p36_updated_4) sh-4.2$ python run_pretraining.py --gpus 0,1,2,3,4,5,6,7 --batch_size 8 --lr 1e-3 --data "out_mcg_test-big/part-001.npz"  @--warmup_ratio 0.01 --num_steps 1000000 --log_interval=250 --data_eval "out_mcg_test-big/part-001.npz" --batch_size_eval 8 --ckpt_dir ckpt --ckpt_interval 25000 --accumulate 4 --num_buckets 10 --dtype float16 
INFO:root:Namespace(accumulate=4, batch_size=8, batch_size_eval=8, by_token=False, ckpt_dir='ckpt', ckpt_interval=25000, data='out_mcg_test-big/part-001.npz', data_eval='out_mcg_test-big/part-001.npz', dataset_name='book_corpus_wiki_en_uncased',dtype='float16', eval_only=False, gpus='0,1,2,3,4,5,6,7', kvstore='device', log_interval=250, lr=0.001, model='bert_12_768_12', num_buckets=10, num_steps=1000000, pretrained=False, profile=False, seed=0, start_step=0, verbose=False, warmup_ratio=0.01) 
INFO:root:Using training data at out_mcg_test-big/part-001.npz 
[20:42:24] src/kvstore/././comm_tree.h:356: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off 
[20:42:24] src/kvstore/././comm_tree.h:365: .vvvv... 
[20:42:24] src/kvstore/././comm_tree.h:365: v.vv.v.. 
[20:42:24] src/kvstore/././comm_tree.h:365: vv.v..v. 
[20:42:24] src/kvstore/././comm_tree.h:365: vvv....v 
... 
[20:42:24] src/kvstore/./././gpu_topology.h:216: cudaDeviceGetP2PAttribute incorrect. Falling back to cudaDeviceEnablePeerAccess for topology detection 
[20:42:24] src/kvstore/././comm_tree.h:380: Using Kernighan-Lin to generate trees 
[20:42:24] src/kvstore/././comm_tree.h:391: Using Tree 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 2 occurs 1 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 768 occurs 114 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 1536 occurs 2 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 3072 occurs 12 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 30522 occurs 1 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 393216 occurs 1 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 589824 occurs 50 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 2359296 occurs 24 times 
[20:42:25] src/kvstore/././comm_tree.h:488: Size 23440896 occurs 1 times 
INFO:root:[step 249] mlm_loss=8.02087 mlm_acc=6.88167 nsp_loss=0.69021 nsp_acc=53.343 throughput=24.2K tks/s lr=0.0000249 time=315.28 
INFO:root:[step 499] mlm_loss=6.85134 mlm_acc=11.51758 nsp_loss=0.65648 nsp_acc=60.298 throughput=57.7K tks/s lr=0.0000499 time=133.49 
INFO:root:[step 749] mlm_loss=6.60548 mlm_acc=13.98383 nsp_loss=0.58539 nsp_acc=67.169 throughput=57.7K tks/s lr=0.0000749 time=133.54

With mxnet-cu90==1.5.0b20190426 (This is not working)

#(Same)# 
INFO:root:[step 249]    mlm_loss=nan    mlm_acc=4.56305 nsp_loss=nan    nsp_acc=54.454  throughput=23.7K tks/s  lr=0.0000249 time=321.78
INFO:root:[step 499]    mlm_loss=7.27492        mlm_acc=5.76089 nsp_loss=0.68847        nsp_acc=54.719  throughput=57.4K tks/s     lr=0.0000499 time=134.22
INFO:root:[step 749]    mlm_loss=7.26470        mlm_acc=5.82224 nsp_loss=0.68894        nsp_acc=54.428  throughput=57.3K tks/s     lr=0.0000749 time=134.40

araitats · 2019-05-02T21:55:52Z

apache/mxnet#14864

araitats · 2019-05-02T22:23:06Z

I used run_pretraining.py from...
http://gluon-nlp.mxnet.io/_downloads/bert.zip

tlby · 2019-05-03T17:12:42Z

I believe this is the same issue I've been seeing, bisected to apache/mxnet@369b66d0f from apache/mxnet#14785

eric-haibin-lin · 2019-05-04T23:21:11Z

@tlby thanks reporting. Was your issue also about running the pre-training script?

tlby · 2019-05-04T23:54:07Z

I was testing with the finetuning example. It looks like MXNet reverted the PR in question so I think things are good now.

eric-haibin-lin · 2019-05-05T00:04:06Z

@tlby thanks!
BTW - I'm adding some enhancements to the finetuning script in #692 feel free to comment on it

ZhennanQin · 2019-05-10T04:44:00Z

@eric-haibin-lin @tlby I'm trying to fix the trouble PR in MXNet, but failed to setup the bert example to reproduce the regression. The reported command line doesn't work for me, because out_mcg_test-big/part-001.npz not found. I didn't try gluon-nlp bert before, may I know the minimal command lines to reproduce this issue on top of MXNet and gluon-nlp only? Thanks in advance.

eric-haibin-lin · 2019-05-10T04:46:16Z

@ZhennanQin are you able to reproduce apache/mxnet#14868 with

python run_pretraining.py --gpus 0 --batch_size 32 --lr 2e-5 --data 'out/*.npz' --warmup_ratio 0.5 --num_steps 20 --pretrained --log_interval=2 --data_eval 'out/*.npz' --batch_size_eval 8 --ckpt_dir ckpt

from http://gluon-nlp.mxnet.io/model_zoo/bert/index.html#bert-pre-training ?

ZhennanQin · 2019-05-11T07:36:14Z

@eric-haibin-lin This command works. Thanks. The fixed PR is filed again at apache/mxnet#14931, please take a look.

araitats mentioned this issue May 2, 2019

[MXNet] - [BERT] apache/mxnet#14864

Closed

szha added the bug Something isn't working label May 2, 2019

szha assigned eric-haibin-lin May 2, 2019

anirudhacharya mentioned this issue May 3, 2019

Revert "Improve cached_op performance for static mode" apache/mxnet#14868

Merged

eric-haibin-lin mentioned this issue May 3, 2019

how to inference after finetune a classifier in the script of bert? #662

Closed

szha closed this as completed in apache/mxnet#14868 May 3, 2019

ZhennanQin mentioned this issue May 11, 2019

Re-enable static cached_op optimization apache/mxnet#14931

Merged

7 tasks

pengzhao-intel mentioned this issue Jun 10, 2019

Improve static cached_op optimization apache/mxnet#15187

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNet] - [BERT] #690

[MXNet] - [BERT] #690

araitats commented May 2, 2019 •

edited

Loading

araitats commented May 2, 2019

araitats commented May 2, 2019

tlby commented May 3, 2019

eric-haibin-lin commented May 4, 2019

tlby commented May 4, 2019

eric-haibin-lin commented May 5, 2019

ZhennanQin commented May 10, 2019

eric-haibin-lin commented May 10, 2019

ZhennanQin commented May 11, 2019

[MXNet] - [BERT] #690

[MXNet] - [BERT] #690

Comments

araitats commented May 2, 2019 • edited Loading

araitats commented May 2, 2019

araitats commented May 2, 2019

tlby commented May 3, 2019

eric-haibin-lin commented May 4, 2019

tlby commented May 4, 2019

eric-haibin-lin commented May 5, 2019

ZhennanQin commented May 10, 2019

eric-haibin-lin commented May 10, 2019

ZhennanQin commented May 11, 2019

araitats commented May 2, 2019 •

edited

Loading