[1.x / 1.8] Regression in runtime fusion #19316

leezu · 2020-10-08T19:53:26Z

In MXNet 1.8, XLNet model fails to run due to bug in runtime fusion:

 Reshape_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_kernel.cu(1084): Error: Formal parameter space overflowed (4648 bytes required, max 4096 bytes allowed) in function

dmlc/gluon-nlp#1230 (comment)

The issue persists with the latest nightly builds of the 1.x branch, ie applies to the 1.8 release. See dmlc/gluon-nlp#1384

The text was updated successfully, but these errors were encountered:

ptrendx · 2020-10-08T20:16:29Z

Well, I kind of agree it is a fusion bug, but this specific failure should also be tackled on the Gluon-NLP side. Having so many expand_dims on 1 thing suggests at least a performance bug in there.

leezu · 2020-10-08T20:45:58Z

The concern here is that the regression breaks models that worked without problem.

ptrendx · 2020-10-08T20:46:45Z

But is it actually a regression? Do 1.6/1.7 work with the same model? I don't think they would work.

leezu · 2020-10-08T20:51:20Z

Yes, this works in 1.6. There are separate integration tests for 1.6 and nightly build in GluonNLP and the 1.6 tests pass. You can refer to dmlc/gluon-nlp#1237 (comment) which introduced a version of MXNet containing the regression. I confirmed that the regression is still present in latest 1.x branch via dmlc/gluon-nlp#1384

ptrendx · 2020-10-08T20:59:21Z

Ok, so this suggests it broke in 1.7 then. It is quite strange though since if you look at https://github.com/apache/incubator-mxnet/commits/1.8.0.rc1/src/operator/fusion there are only a few commits that went into fusion between 1.6 and 1.7 and none of them looks like something that could affect this.

I will look into this.

leezu · 2020-10-08T21:09:11Z

Thank you. You may be right that it's an edge case. Btw, more evidence that this works in 1.6 is that you actually fixed a performance bug based on the GluonNLP XLNet model in 1.6: #17105

ptrendx · 2020-10-08T22:08:15Z

Well, that is actually not really an evidence because I am using that exact model (from GluonNLP 0.9.0) for my performance experiments in the current PR (that said, I'm using NVIDIA container, but 1.8rc1 is merged there and I don't believe there are any differences in fusion between NVIDIA version and upstream) and it does not show that issue. So my assumption is that this model changed in the meantime.

leezu · 2020-10-08T22:26:56Z

Any change would be verified by the GluonNLP CI. I believe fusion is activated by default in 1.6, thus the CI should test fusion?

These are the outputs of the CI for MXNet 1.6 and 1.x:

[2020-09-11T00:38:06.555Z] scripts/tests/test_scripts.py::test_xlnet_finetune_glue[MRPC] 2020-09-11 00:38:06,516 - root - INFO - Namespace(accumulate=None, attention_dropout=0.1, batch_size=32, dataset='126gb', dev_batch_size=32, dropout=0.1, early_stop=None, epochs=1, epsilon=1e-06, gpu=1, log_interval=10, lr=3e-05, lr_decay='linear', max_len=32, model_name='xlnet_cased_l12_h768_a12', model_parameters=None, only_inference=False, optimizer='Adam', output_dir='./output_dir', round_to=None, seed=2, task_name='MRPC', training_steps=None, warmup_ratio=0)
[2020-09-11T00:38:07.115Z] 2020-09-11 00:38:06,961 - root - INFO - Model file not found. Downloading to tests/data/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.params.
[2020-09-11T00:38:29.018Z] 2020-09-11 00:38:26,798 - root - INFO - processing dataset...
[2020-09-11T00:38:29.275Z] Vocab file is not found. Downloading.
[2020-09-11T00:38:29.275Z] Downloading tests/data/models/4075451053579957634/4075451053579957634_xlnet_126gb-0d744903.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/xlnet_126gb-0d744903.zip...
[2020-09-11T00:38:29.275Z] Downloading tests/data/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.zip5526533c-632d-4b5b-9a75-fed8afc36bbc from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.zip...
[2020-09-11T00:38:29.275Z] Tokenizer file is not found. Downloading.
[2020-09-11T00:38:29.275Z] Downloading tests/data/models/1599784706.6524212xlnet_126gb-871f0b3c.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/xlnet_126gb-871f0b3c.zip...
[2020-09-11T00:38:29.275Z] Downloading tests/data/datasets/glue_mrpc/msr_paraphrase_train.txt from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt...
[2020-09-11T00:38:29.275Z] Downloading tests/data/datasets/glue_mrpc/dev_ids.tsv from https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc...
[2020-09-11T00:38:31.168Z] Downloading tests/data/datasets/glue_mrpc/msr_paraphrase_test.txt from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt...
[2020-09-11T00:38:31.168Z] 2020-09-11 00:38:30,826 - root - INFO - Now we are doing XLNet classification training on [gpu(0)]!
[2020-09-11T00:38:31.168Z] 2020-09-11 00:38:30,835 - root - INFO - training steps=114
[2020-09-11T00:38:37.715Z] 2020-09-11 00:38:36,772 - root - INFO - Time cost for the first forward-backward =5.93s
[2020-09-11T00:38:39.607Z] 2020-09-11 00:38:39,249 - root - INFO - [Epoch 1 Batch 10/121] loss=0.6671, lr=0.0000276
[2020-09-11T00:38:42.130Z] 2020-09-11 00:38:42,023 - root - INFO - [Epoch 1 Batch 20/121] loss=0.6284, lr=0.0000250
[2020-09-11T00:38:45.403Z] 2020-09-11 00:38:44,803 - root - INFO - [Epoch 1 Batch 30/121] loss=0.6076, lr=0.0000224
[2020-09-11T00:38:47.962Z] 2020-09-11 00:38:47,583 - root - INFO - [Epoch 1 Batch 40/121] loss=0.6318, lr=0.0000197
[2020-09-11T00:38:50.483Z] 2020-09-11 00:38:50,242 - root - INFO - [Epoch 1 Batch 50/121] loss=0.6222, lr=0.0000171
[2020-09-11T00:38:53.006Z] 2020-09-11 00:38:52,900 - root - INFO - [Epoch 1 Batch 60/121] loss=0.5814, lr=0.0000145
[2020-09-11T00:38:55.529Z] 2020-09-11 00:38:55,232 - root - INFO - [Epoch 1 Batch 70/121] loss=0.6682, lr=0.0000118
[2020-09-11T00:38:58.051Z] 2020-09-11 00:38:57,597 - root - INFO - [Epoch 1 Batch 80/121] loss=0.6621, lr=0.0000092
[2020-09-11T00:39:00.573Z] 2020-09-11 00:39:00,412 - root - INFO - [Epoch 1 Batch 90/121] loss=0.5737, lr=0.0000066
[2020-09-11T00:39:03.847Z] 2020-09-11 00:39:03,219 - root - INFO - [Epoch 1 Batch 100/121] loss=0.5999, lr=0.0000039
[2020-09-11T00:39:06.371Z] 2020-09-11 00:39:05,890 - root - INFO - [Epoch 1 Batch 110/121] loss=0.5980, lr=0.0000013
[2020-09-11T00:39:07.301Z] 2020-09-11 00:39:07,028 - root - INFO - Finish training step: 114
[2020-09-11T00:39:07.301Z] 2020-09-11 00:39:07,028 - root - INFO - Now we are doing evaluation on dev with [gpu(0)].
[2020-09-11T00:39:08.232Z] 2020-09-11 00:39:08,011 - root - INFO - [Batch 10/13] loss=0.5612
[2020-09-11T00:39:08.489Z] 2020-09-11 00:39:08,289 - root - INFO - validation metrics:accuracy:0.7108,f1:0.8168
[2020-09-11T00:39:08.489Z] 2020-09-11 00:39:08,289 - root - INFO - Time cost=1.26s, throughput=330.05 samples/s
[2020-09-11T00:39:09.419Z] 2020-09-11 00:39:09,069 - root - INFO - params saved in: ./output_dir/model_xlnet_MRPC_0.params
[2020-09-11T00:39:09.419Z] 2020-09-11 00:39:09,069 - root - INFO - Time cost=38.23s
[2020-09-11T00:39:09.419Z] 2020-09-11 00:39:09,358 - root - INFO - Best model at epoch 1. Validation metrics:accuracy:0.7108,f1:0.8168
[2020-09-11T00:39:09.419Z] 2020-09-11 00:39:09,358 - root - INFO - Now we are doing testing on test with [gpu(0)].
[2020-09-11T00:39:15.982Z] 2020-09-11 00:39:15,329 - root - INFO - Time cost=5.97s, throughput=289.43 samples/s
[2020-09-11T00:39:15.982Z] finish test!

[2020-09-11T00:31:37.908Z] scripts/tests/test_scripts.py::test_xlnet_finetune_glue[MRPC] 2020-09-11 00:31:37,879 - root - INFO - Namespace(accumulate=None, attention_dropout=0.1, batch_size=32, dataset='126gb', dev_batch_size=32, dropout=0.1, early_stop=None, epochs=1, epsilon=1e-06, gpu=1, log_interval=10, lr=3e-05, lr_decay='linear', max_len=32, model_name='xlnet_cased_l12_h768_a12', model_parameters=None, only_inference=False, optimizer='Adam', output_dir='./output_dir', round_to=None, seed=2, task_name='MRPC', training_steps=None, warmup_ratio=0)
[2020-09-11T00:31:38.471Z] 2020-09-11 00:31:38,371 - root - INFO - Model file not found. Downloading to tests/data/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.params.
[2020-09-11T00:32:00.375Z] 2020-09-11 00:31:58,703 - root - INFO - processing dataset...
[2020-09-11T00:32:01.304Z] Vocab file is not found. Downloading.
[2020-09-11T00:32:01.304Z] Downloading tests/data/models/7866557629997205756/7866557629997205756_xlnet_126gb-0d744903.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/xlnet_126gb-0d744903.zip...
[2020-09-11T00:32:01.304Z] Downloading tests/data/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.zip33c867d1-f83e-447d-8253-6504231ddccb from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.zip...
[2020-09-11T00:32:01.304Z] Tokenizer file is not found. Downloading.
[2020-09-11T00:32:01.304Z] Downloading tests/data/models/1599784318.5417857xlnet_126gb-871f0b3c.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/xlnet_126gb-871f0b3c.zip...
[2020-09-11T00:32:01.304Z] Downloading tests/data/datasets/glue_mrpc/msr_paraphrase_train.txt from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt...
[2020-09-11T00:32:01.304Z] Downloading tests/data/datasets/glue_mrpc/dev_ids.tsv from https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc...
[2020-09-11T00:32:02.675Z] Downloading tests/data/datasets/glue_mrpc/msr_paraphrase_test.txt from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt...
[2020-09-11T00:32:02.675Z] 2020-09-11 00:32:02,673 - root - INFO - Now we are doing XLNet classification training on [gpu(0)]!
[2020-09-11T00:32:02.932Z] 2020-09-11 00:32:02,682 - root - INFO - training steps=114
[2020-09-11T00:32:08.187Z] Traceback (most recent call last):
[2020-09-11T00:32:08.187Z]   File "./scripts/language_model/run_glue.py", line 658, in <module>
[2020-09-11T00:32:08.187Z]     train(task.metrics)
[2020-09-11T00:32:08.187Z]   File "./scripts/language_model/run_glue.py", line 551, in train
[2020-09-11T00:32:08.187Z]     nlp.utils.clip_grad_global_norm(params, 1)
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/src/gluonnlp/utils/parameter.py", line 147, in clip_grad_global_norm
[2020-09-11T00:32:08.187Z]     if is_finite != 1:
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 422, in __bool__
[2020-09-11T00:32:08.187Z]     return bool(self.asscalar())
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 2585, in asscalar
[2020-09-11T00:32:08.187Z]     return self.asnumpy()[0]
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 2566, in asnumpy
[2020-09-11T00:32:08.187Z]     ctypes.c_size_t(data.size)))
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/base.py", line 246, in check_call
[2020-09-11T00:32:08.187Z]     raise get_last_ffi_error()
[2020-09-11T00:32:08.187Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2020-09-11T00:32:08.187Z]   File "src/operator/fusion/fused_op.cu", line 647
[2020-09-11T00:32:08.187Z] MXNetError: Check failed: compileResult == NVRTC_SUCCESS (6 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0.
[2020-09-11T00:32:08.187Z] Reshape_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_kernel.cu(792): warning: variable "ndim_output143" was declared but never referenced

ptrendx · 2020-11-03T17:05:27Z

@leezu I'm working on this now. There is something fishy going on here - I tried reproducing it by compiling v1.8.x branch and the test passed. I also tried the pip package that you tried in your PR in GluonNLP, and that one failed. I will try to figure out where does the difference come from.

ptrendx · 2020-11-05T21:48:21Z

Update: the code generation itself seems to be working fine - it is the subgraph it gets that is wrong. I'm not yet sure if this is fault of some bug in common subexpression elimination (since with export MXNET_ELIMINATE_COMMON_EXPR=0 the bug does not repro) or with the fusion graph pass.

ptrendx · 2020-11-06T17:00:56Z

Another update - the problem comes from the eliminate common expressions. In the model there seem to be 12 reshapes of the same input, each followed by 12 expand_dims. In the correct execution, they all get squashed to a single reshape followed by expand_dims, but in the wrong execution the reshapes get squashed into 1, but the (12*12=)144 expand_dims do not, resulting in a fused operator with 144 outputs.

I'm still trying to understand why there is this discrepancy in execution.

ptrendx · 2020-11-06T19:26:57Z

Ok, that was pretty stupid. The reason why expand_dims were not squashed in the official pip wheel for 1.7 and 1.8 is because common expression elimination does not currently try to squash operators that use any ResourceRequest (even though it could if the request is kTempSpace) and with MKLDNN expand_dims actually requests kTempSpace. 1.6 had mkldnn off by default (so worked), but 1.7 and 1.8 have it on by default in the official pip wheels (and it is off in our container, so when I compiled it in that environment I did not repro the problem).

I will make a PR shortly to allow eliminating expressions with tempspace to unblock upgrading GluonNLP to 1.x, then will think about the changes to code generation of fusion, so that this error is not triggered. Generally speaking the problem comes from the fact that we send all the inputs/outputs shapes to the fused kernel even if they are not needed, which greatly pollutes the argument space. One can only use 4096B of arguments for the kernel, so without shapes a fused kernel could have up to 512 inputs/outputs, but having shapes greatly reduces this number (to less than 144 as seen in this bug ;-) ).

leezu · 2020-11-07T03:30:22Z

Thank you @ptrendx!

leezu added the Bug label Oct 8, 2020

leezu assigned ptrendx Oct 8, 2020

This was referenced Nov 6, 2020

[FEATURE]Allow eliminating common subexpressions when temp space is used #19486

Merged

[1.x, FEATURE] Allow eliminating common subexpressions when temp space is used #19487

Merged

leezu closed this as completed Nov 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.x / 1.8] Regression in runtime fusion #19316

[1.x / 1.8] Regression in runtime fusion #19316

leezu commented Oct 8, 2020

ptrendx commented Oct 8, 2020

leezu commented Oct 8, 2020

ptrendx commented Oct 8, 2020 •

edited

Loading

leezu commented Oct 8, 2020

ptrendx commented Oct 8, 2020

leezu commented Oct 8, 2020

ptrendx commented Oct 8, 2020

leezu commented Oct 8, 2020

ptrendx commented Nov 3, 2020

ptrendx commented Nov 5, 2020

ptrendx commented Nov 6, 2020

ptrendx commented Nov 6, 2020 •

edited

Loading

leezu commented Nov 7, 2020

[1.x / 1.8] Regression in runtime fusion #19316

[1.x / 1.8] Regression in runtime fusion #19316

Comments

leezu commented Oct 8, 2020

ptrendx commented Oct 8, 2020

leezu commented Oct 8, 2020

ptrendx commented Oct 8, 2020 • edited Loading

leezu commented Oct 8, 2020

ptrendx commented Oct 8, 2020

leezu commented Oct 8, 2020

ptrendx commented Oct 8, 2020

leezu commented Oct 8, 2020

ptrendx commented Nov 3, 2020

ptrendx commented Nov 5, 2020

ptrendx commented Nov 6, 2020

ptrendx commented Nov 6, 2020 • edited Loading

leezu commented Nov 7, 2020

ptrendx commented Oct 8, 2020 •

edited

Loading

ptrendx commented Nov 6, 2020 •

edited

Loading