Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Revert "Improve cached_op performance for static mode" #14868

Merged
merged 1 commit into from
May 3, 2019

Conversation

anirudhacharya
Copy link
Member

@anirudhacharya anirudhacharya commented May 3, 2019

Reverts #14785

This revert Fixes #14864 and Fixes dmlc/gluon-nlp#690

This commit 369b66d caused a regression in BERT model training

As seen below the commit caused the nsp_acc to drop from 100 to 55.2

commit 369b66d0f10ba479ce96f78f7c838bd7bc41d951 -

INFO:root:[step 1]	mlm_loss=1.65551	mlm_acc=48.11490	nsp_loss=0.43527	nsp_acc=81.250	throughput=1.6K tks/s	lr=0.0000020 time=2.32, latency=1161.4 ms/batch
INFO:root:[step 3]	mlm_loss=7.84039	mlm_acc=2.22965	nsp_loss=0.73171	nsp_acc=43.103	throughput=2.2K tks/s	lr=0.0000060 time=2.73, latency=1364.9 ms/batch
INFO:root:[step 5]	mlm_loss=7.80324	mlm_acc=2.68692	nsp_loss=0.73161	nsp_acc=42.308	throughput=2.4K tks/s	lr=0.0000100 time=2.43, latency=1217.0 ms/batch
INFO:root:[step 7]	mlm_loss=7.69260	mlm_acc=1.55763	nsp_loss=0.71501	nsp_acc=46.552	throughput=2.4K tks/s	lr=0.0000140 time=2.64, latency=1320.8 ms/batch
INFO:root:[step 9]	mlm_loss=7.72376	mlm_acc=2.29167	nsp_loss=0.73156	nsp_acc=37.931	throughput=2.4K tks/s	lr=0.0000180 time=2.70, latency=1350.2 ms/batch
INFO:root:[step 11]	mlm_loss=7.62214	mlm_acc=2.19436	nsp_loss=0.69882	nsp_acc=51.724	throughput=2.4K tks/s	lr=0.0000090 time=2.65, latency=1322.5 ms/batch
INFO:root:[step 13]	mlm_loss=7.49625	mlm_acc=2.46781	nsp_loss=0.72365	nsp_acc=43.103	throughput=2.3K tks/s	lr=0.0000070 time=2.70, latency=1347.8 ms/batch
INFO:root:[step 15]	mlm_loss=7.47410	mlm_acc=2.18424	nsp_loss=0.71855	nsp_acc=39.062	throughput=2.4K tks/s	lr=0.0000050 time=2.92, latency=1458.7 ms/batch
INFO:root:[step 17]	mlm_loss=7.30681	mlm_acc=2.56674	nsp_loss=0.68619	nsp_acc=53.448	throughput=2.4K tks/s	lr=0.0000030 time=2.70, latency=1348.9 ms/batch
INFO:root:[step 19]	mlm_loss=7.61227	mlm_acc=1.75824	nsp_loss=0.71591	nsp_acc=44.828	throughput=2.2K tks/s	lr=0.0000010 time=2.76, latency=1380.5 ms/batch
INFO:root:[step 20] Saving checkpoints to ckpt/0000020.params, ckpt/0000020.states.
INFO:root:Train cost=45.5s
INFO:root:Using evaluation data at out/*.npz
INFO:root:[step 1]	mlm_loss=3.74667	mlm_acc=1.51515	nsp_loss=0.35332	nsp_acc=25.000	throughput=2.9K tks/s	lr=0.0000000 time=0.30, latency=149.5 ms/batch
INFO:root:[step 3]	mlm_loss=7.30128	mlm_acc=3.28467	nsp_loss=0.68692	nsp_acc=62.500	throughput=5.0K tks/s	lr=0.0000000 time=0.37, latency=185.7 ms/batch
INFO:root:[step 5]	mlm_loss=7.55211	mlm_acc=2.85714	nsp_loss=0.67706	nsp_acc=81.250	throughput=5.0K tks/s	lr=0.0000000 time=0.33, latency=162.8 ms/batch
INFO:root:[step 7]	mlm_loss=7.07615	mlm_acc=2.29008	nsp_loss=0.69678	nsp_acc=43.750	throughput=5.4K tks/s	lr=0.0000000 time=0.32, latency=161.2 ms/batch
INFO:root:mlm_loss=6.419	mlm_acc=2.5	nsp_loss=0.604	nsp_acc=55.2	
INFO:root:Eval cost=1.4s

commit 5dd9fa27d8bdd2a8677b7c275a494d17082c0e1c

INFO:root:[step 1]	mlm_loss=1.65551	mlm_acc=48.11490	nsp_loss=0.43527	nsp_acc=81.250	throughput=1.6K tks/s	lr=0.0000020 time=2.33, latency=1166.0 ms/batch
INFO:root:[step 3]	mlm_loss=3.35410	mlm_acc=47.38016	nsp_loss=0.70400	nsp_acc=84.483	throughput=2.2K tks/s	lr=0.0000060 time=2.76, latency=1379.8 ms/batch
INFO:root:[step 5]	mlm_loss=2.86958	mlm_acc=51.75234	nsp_loss=0.03236	nsp_acc=100.000	throughput=2.3K tks/s	lr=0.0000100 time=2.49, latency=1246.8 ms/batch
INFO:root:[step 7]	mlm_loss=2.53454	mlm_acc=57.21703	nsp_loss=0.14932	nsp_acc=94.828	throughput=2.3K tks/s	lr=0.0000140 time=2.76, latency=1380.5 ms/batch
INFO:root:[step 9]	mlm_loss=2.13252	mlm_acc=63.02083	nsp_loss=0.03085	nsp_acc=98.276	throughput=2.3K tks/s	lr=0.0000180 time=2.79, latency=1396.6 ms/batch
INFO:root:[step 11]	mlm_loss=1.36580	mlm_acc=74.39916	nsp_loss=0.00306	nsp_acc=100.000	throughput=2.3K tks/s	lr=0.0000090 time=2.75, latency=1372.9 ms/batch
INFO:root:[step 13]	mlm_loss=1.00501	mlm_acc=80.79399	nsp_loss=0.00274	nsp_acc=100.000	throughput=2.2K tks/s	lr=0.0000070 time=2.78, latency=1392.1 ms/batch
INFO:root:[step 15]	mlm_loss=0.82224	mlm_acc=83.28585	nsp_loss=0.00181	nsp_acc=100.000	throughput=2.3K tks/s	lr=0.0000050 time=3.04, latency=1520.9 ms/batch
INFO:root:[step 17]	mlm_loss=0.54528	mlm_acc=89.11704	nsp_loss=0.00083	nsp_acc=100.000	throughput=2.3K tks/s	lr=0.0000030 time=2.79, latency=1396.3 ms/batch
INFO:root:[step 19]	mlm_loss=0.53212	mlm_acc=88.90110	nsp_loss=0.00087	nsp_acc=100.000	throughput=2.2K tks/s	lr=0.0000010 time=2.76, latency=1379.5 ms/batch
INFO:root:[step 20] Saving checkpoints to ckpt/0000020.params, ckpt/0000020.states.
INFO:root:Train cost=46.3s
INFO:root:Using evaluation data at out/*.npz
INFO:root:[step 1]	mlm_loss=0.08297	mlm_acc=97.72727	nsp_loss=0.00008	nsp_acc=100.000	throughput=2.9K tks/s	lr=0.0000000 time=0.30, latency=150.7 ms/batch
INFO:root:[step 3]	mlm_loss=0.34548	mlm_acc=93.06569	nsp_loss=0.00016	nsp_acc=100.000	throughput=5.1K tks/s	lr=0.0000000 time=0.36, latency=180.2 ms/batch
INFO:root:[step 5]	mlm_loss=0.34622	mlm_acc=92.24490	nsp_loss=0.00068	nsp_acc=100.000	throughput=5.0K tks/s	lr=0.0000000 time=0.33, latency=162.9 ms/batch
INFO:root:[step 7]	mlm_loss=0.40680	mlm_acc=92.36641	nsp_loss=0.00018	nsp_acc=100.000	throughput=5.4K tks/s	lr=0.0000000 time=0.32, latency=161.7 ms/batch
INFO:root:mlm_loss=0.295	mlm_acc=93.2	nsp_loss=0.000	nsp_acc=100.0	
INFO:root:Eval cost=1.4s

I think it might be good to revert this PR for now and then revisit the original PR and fix it.

@eric-haibin-lin
Copy link
Member

cc @ZhennanQin

@szha szha merged commit 204f3f2 into apache:master May 3, 2019
@anirudhacharya anirudhacharya deleted the revert-14785-static_cached_op branch May 3, 2019 17:42
@eric-haibin-lin
Copy link
Member

@ZhennanQin could you look into why this PR introduces the regression?

access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MXNet] - [BERT] [MXNet] - [BERT]
3 participants