Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[1.x / 1.8] Regression in runtime fusion #19316

Closed
leezu opened this issue Oct 8, 2020 · 13 comments
Closed

[1.x / 1.8] Regression in runtime fusion #19316

leezu opened this issue Oct 8, 2020 · 13 comments
Assignees
Labels

Comments

@leezu
Copy link
Contributor

leezu commented Oct 8, 2020

In MXNet 1.8, XLNet model fails to run due to bug in runtime fusion:

 Reshape_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_kernel.cu(1084): Error: Formal parameter space overflowed (4648 bytes required, max 4096 bytes allowed) in function

dmlc/gluon-nlp#1230 (comment)

The issue persists with the latest nightly builds of the 1.x branch, ie applies to the 1.8 release. See dmlc/gluon-nlp#1384

@leezu leezu added the Bug label Oct 8, 2020
@ptrendx
Copy link
Member

ptrendx commented Oct 8, 2020

Well, I kind of agree it is a fusion bug, but this specific failure should also be tackled on the Gluon-NLP side. Having so many expand_dims on 1 thing suggests at least a performance bug in there.

@leezu
Copy link
Contributor Author

leezu commented Oct 8, 2020

The concern here is that the regression breaks models that worked without problem.

@ptrendx
Copy link
Member

ptrendx commented Oct 8, 2020

But is it actually a regression? Do 1.6/1.7 work with the same model? I don't think they would work.

@leezu
Copy link
Contributor Author

leezu commented Oct 8, 2020

Yes, this works in 1.6. There are separate integration tests for 1.6 and nightly build in GluonNLP and the 1.6 tests pass. You can refer to dmlc/gluon-nlp#1237 (comment) which introduced a version of MXNet containing the regression. I confirmed that the regression is still present in latest 1.x branch via dmlc/gluon-nlp#1384

@ptrendx
Copy link
Member

ptrendx commented Oct 8, 2020

Ok, so this suggests it broke in 1.7 then. It is quite strange though since if you look at https://github.com/apache/incubator-mxnet/commits/1.8.0.rc1/src/operator/fusion there are only a few commits that went into fusion between 1.6 and 1.7 and none of them looks like something that could affect this.

I will look into this.

@leezu
Copy link
Contributor Author

leezu commented Oct 8, 2020

Thank you. You may be right that it's an edge case. Btw, more evidence that this works in 1.6 is that you actually fixed a performance bug based on the GluonNLP XLNet model in 1.6: #17105

@ptrendx
Copy link
Member

ptrendx commented Oct 8, 2020

Well, that is actually not really an evidence because I am using that exact model (from GluonNLP 0.9.0) for my performance experiments in the current PR (that said, I'm using NVIDIA container, but 1.8rc1 is merged there and I don't believe there are any differences in fusion between NVIDIA version and upstream) and it does not show that issue. So my assumption is that this model changed in the meantime.

@leezu
Copy link
Contributor Author

leezu commented Oct 8, 2020

Any change would be verified by the GluonNLP CI. I believe fusion is activated by default in 1.6, thus the CI should test fusion?

These are the outputs of the CI for MXNet 1.6 and 1.x:

[2020-09-11T00:38:06.555Z] scripts/tests/test_scripts.py::test_xlnet_finetune_glue[MRPC] 2020-09-11 00:38:06,516 - root - INFO - Namespace(accumulate=None, attention_dropout=0.1, batch_size=32, dataset='126gb', dev_batch_size=32, dropout=0.1, early_stop=None, epochs=1, epsilon=1e-06, gpu=1, log_interval=10, lr=3e-05, lr_decay='linear', max_len=32, model_name='xlnet_cased_l12_h768_a12', model_parameters=None, only_inference=False, optimizer='Adam', output_dir='./output_dir', round_to=None, seed=2, task_name='MRPC', training_steps=None, warmup_ratio=0)
[2020-09-11T00:38:07.115Z] 2020-09-11 00:38:06,961 - root - INFO - Model file not found. Downloading to tests/data/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.params.
[2020-09-11T00:38:29.018Z] 2020-09-11 00:38:26,798 - root - INFO - processing dataset...
[2020-09-11T00:38:29.275Z] Vocab file is not found. Downloading.
[2020-09-11T00:38:29.275Z] Downloading tests/data/models/4075451053579957634/4075451053579957634_xlnet_126gb-0d744903.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/xlnet_126gb-0d744903.zip...
[2020-09-11T00:38:29.275Z] Downloading tests/data/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.zip5526533c-632d-4b5b-9a75-fed8afc36bbc from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.zip...
[2020-09-11T00:38:29.275Z] Tokenizer file is not found. Downloading.
[2020-09-11T00:38:29.275Z] Downloading tests/data/models/1599784706.6524212xlnet_126gb-871f0b3c.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/xlnet_126gb-871f0b3c.zip...
[2020-09-11T00:38:29.275Z] Downloading tests/data/datasets/glue_mrpc/msr_paraphrase_train.txt from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt...
[2020-09-11T00:38:29.275Z] Downloading tests/data/datasets/glue_mrpc/dev_ids.tsv from https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc...
[2020-09-11T00:38:31.168Z] Downloading tests/data/datasets/glue_mrpc/msr_paraphrase_test.txt from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt...
[2020-09-11T00:38:31.168Z] 2020-09-11 00:38:30,826 - root - INFO - Now we are doing XLNet classification training on [gpu(0)]!
[2020-09-11T00:38:31.168Z] 2020-09-11 00:38:30,835 - root - INFO - training steps=114
[2020-09-11T00:38:37.715Z] 2020-09-11 00:38:36,772 - root - INFO - Time cost for the first forward-backward =5.93s
[2020-09-11T00:38:39.607Z] 2020-09-11 00:38:39,249 - root - INFO - [Epoch 1 Batch 10/121] loss=0.6671, lr=0.0000276
[2020-09-11T00:38:42.130Z] 2020-09-11 00:38:42,023 - root - INFO - [Epoch 1 Batch 20/121] loss=0.6284, lr=0.0000250
[2020-09-11T00:38:45.403Z] 2020-09-11 00:38:44,803 - root - INFO - [Epoch 1 Batch 30/121] loss=0.6076, lr=0.0000224
[2020-09-11T00:38:47.962Z] 2020-09-11 00:38:47,583 - root - INFO - [Epoch 1 Batch 40/121] loss=0.6318, lr=0.0000197
[2020-09-11T00:38:50.483Z] 2020-09-11 00:38:50,242 - root - INFO - [Epoch 1 Batch 50/121] loss=0.6222, lr=0.0000171
[2020-09-11T00:38:53.006Z] 2020-09-11 00:38:52,900 - root - INFO - [Epoch 1 Batch 60/121] loss=0.5814, lr=0.0000145
[2020-09-11T00:38:55.529Z] 2020-09-11 00:38:55,232 - root - INFO - [Epoch 1 Batch 70/121] loss=0.6682, lr=0.0000118
[2020-09-11T00:38:58.051Z] 2020-09-11 00:38:57,597 - root - INFO - [Epoch 1 Batch 80/121] loss=0.6621, lr=0.0000092
[2020-09-11T00:39:00.573Z] 2020-09-11 00:39:00,412 - root - INFO - [Epoch 1 Batch 90/121] loss=0.5737, lr=0.0000066
[2020-09-11T00:39:03.847Z] 2020-09-11 00:39:03,219 - root - INFO - [Epoch 1 Batch 100/121] loss=0.5999, lr=0.0000039
[2020-09-11T00:39:06.371Z] 2020-09-11 00:39:05,890 - root - INFO - [Epoch 1 Batch 110/121] loss=0.5980, lr=0.0000013
[2020-09-11T00:39:07.301Z] 2020-09-11 00:39:07,028 - root - INFO - Finish training step: 114
[2020-09-11T00:39:07.301Z] 2020-09-11 00:39:07,028 - root - INFO - Now we are doing evaluation on dev with [gpu(0)].
[2020-09-11T00:39:08.232Z] 2020-09-11 00:39:08,011 - root - INFO - [Batch 10/13] loss=0.5612
[2020-09-11T00:39:08.489Z] 2020-09-11 00:39:08,289 - root - INFO - validation metrics:accuracy:0.7108,f1:0.8168
[2020-09-11T00:39:08.489Z] 2020-09-11 00:39:08,289 - root - INFO - Time cost=1.26s, throughput=330.05 samples/s
[2020-09-11T00:39:09.419Z] 2020-09-11 00:39:09,069 - root - INFO - params saved in: ./output_dir/model_xlnet_MRPC_0.params
[2020-09-11T00:39:09.419Z] 2020-09-11 00:39:09,069 - root - INFO - Time cost=38.23s
[2020-09-11T00:39:09.419Z] 2020-09-11 00:39:09,358 - root - INFO - Best model at epoch 1. Validation metrics:accuracy:0.7108,f1:0.8168
[2020-09-11T00:39:09.419Z] 2020-09-11 00:39:09,358 - root - INFO - Now we are doing testing on test with [gpu(0)].
[2020-09-11T00:39:15.982Z] 2020-09-11 00:39:15,329 - root - INFO - Time cost=5.97s, throughput=289.43 samples/s
[2020-09-11T00:39:15.982Z] finish test!
[2020-09-11T00:31:37.908Z] scripts/tests/test_scripts.py::test_xlnet_finetune_glue[MRPC] 2020-09-11 00:31:37,879 - root - INFO - Namespace(accumulate=None, attention_dropout=0.1, batch_size=32, dataset='126gb', dev_batch_size=32, dropout=0.1, early_stop=None, epochs=1, epsilon=1e-06, gpu=1, log_interval=10, lr=3e-05, lr_decay='linear', max_len=32, model_name='xlnet_cased_l12_h768_a12', model_parameters=None, only_inference=False, optimizer='Adam', output_dir='./output_dir', round_to=None, seed=2, task_name='MRPC', training_steps=None, warmup_ratio=0)
[2020-09-11T00:31:38.471Z] 2020-09-11 00:31:38,371 - root - INFO - Model file not found. Downloading to tests/data/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.params.
[2020-09-11T00:32:00.375Z] 2020-09-11 00:31:58,703 - root - INFO - processing dataset...
[2020-09-11T00:32:01.304Z] Vocab file is not found. Downloading.
[2020-09-11T00:32:01.304Z] Downloading tests/data/models/7866557629997205756/7866557629997205756_xlnet_126gb-0d744903.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/xlnet_126gb-0d744903.zip...
[2020-09-11T00:32:01.304Z] Downloading tests/data/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.zip33c867d1-f83e-447d-8253-6504231ddccb from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/xlnet_cased_l12_h768_a12_126gb-ca7a0921.zip...
[2020-09-11T00:32:01.304Z] Tokenizer file is not found. Downloading.
[2020-09-11T00:32:01.304Z] Downloading tests/data/models/1599784318.5417857xlnet_126gb-871f0b3c.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/xlnet_126gb-871f0b3c.zip...
[2020-09-11T00:32:01.304Z] Downloading tests/data/datasets/glue_mrpc/msr_paraphrase_train.txt from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt...
[2020-09-11T00:32:01.304Z] Downloading tests/data/datasets/glue_mrpc/dev_ids.tsv from https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc...
[2020-09-11T00:32:02.675Z] Downloading tests/data/datasets/glue_mrpc/msr_paraphrase_test.txt from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt...
[2020-09-11T00:32:02.675Z] 2020-09-11 00:32:02,673 - root - INFO - Now we are doing XLNet classification training on [gpu(0)]!
[2020-09-11T00:32:02.932Z] 2020-09-11 00:32:02,682 - root - INFO - training steps=114
[2020-09-11T00:32:08.187Z] Traceback (most recent call last):
[2020-09-11T00:32:08.187Z]   File "./scripts/language_model/run_glue.py", line 658, in <module>
[2020-09-11T00:32:08.187Z]     train(task.metrics)
[2020-09-11T00:32:08.187Z]   File "./scripts/language_model/run_glue.py", line 551, in train
[2020-09-11T00:32:08.187Z]     nlp.utils.clip_grad_global_norm(params, 1)
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/src/gluonnlp/utils/parameter.py", line 147, in clip_grad_global_norm
[2020-09-11T00:32:08.187Z]     if is_finite != 1:
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 422, in __bool__
[2020-09-11T00:32:08.187Z]     return bool(self.asscalar())
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 2585, in asscalar
[2020-09-11T00:32:08.187Z]     return self.asnumpy()[0]
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 2566, in asnumpy
[2020-09-11T00:32:08.187Z]     ctypes.c_size_t(data.size)))
[2020-09-11T00:32:08.187Z]   File "/var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/base.py", line 246, in check_call
[2020-09-11T00:32:08.187Z]     raise get_last_ffi_error()
[2020-09-11T00:32:08.187Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2020-09-11T00:32:08.187Z]   File "src/operator/fusion/fused_op.cu", line 647
[2020-09-11T00:32:08.187Z] MXNetError: Check failed: compileResult == NVRTC_SUCCESS (6 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0.
[2020-09-11T00:32:08.187Z] Reshape_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_expand_dims_kernel.cu(792): warning: variable "ndim_output143" was declared but never referenced

@ptrendx
Copy link
Member

ptrendx commented Nov 3, 2020

@leezu I'm working on this now. There is something fishy going on here - I tried reproducing it by compiling v1.8.x branch and the test passed. I also tried the pip package that you tried in your PR in GluonNLP, and that one failed. I will try to figure out where does the difference come from.

@ptrendx
Copy link
Member

ptrendx commented Nov 5, 2020

Update: the code generation itself seems to be working fine - it is the subgraph it gets that is wrong. I'm not yet sure if this is fault of some bug in common subexpression elimination (since with export MXNET_ELIMINATE_COMMON_EXPR=0 the bug does not repro) or with the fusion graph pass.

@ptrendx
Copy link
Member

ptrendx commented Nov 6, 2020

Another update - the problem comes from the eliminate common expressions. In the model there seem to be 12 reshapes of the same input, each followed by 12 expand_dims. In the correct execution, they all get squashed to a single reshape followed by expand_dims, but in the wrong execution the reshapes get squashed into 1, but the (12*12=)144 expand_dims do not, resulting in a fused operator with 144 outputs.

I'm still trying to understand why there is this discrepancy in execution.

@ptrendx
Copy link
Member

ptrendx commented Nov 6, 2020

Ok, that was pretty stupid. The reason why expand_dims were not squashed in the official pip wheel for 1.7 and 1.8 is because common expression elimination does not currently try to squash operators that use any ResourceRequest (even though it could if the request is kTempSpace) and with MKLDNN expand_dims actually requests kTempSpace. 1.6 had mkldnn off by default (so worked), but 1.7 and 1.8 have it on by default in the official pip wheels (and it is off in our container, so when I compiled it in that environment I did not repro the problem).

I will make a PR shortly to allow eliminating expressions with tempspace to unblock upgrading GluonNLP to 1.x, then will think about the changes to code generation of fusion, so that this error is not triggered. Generally speaking the problem comes from the fact that we send all the inputs/outputs shapes to the fused kernel even if they are not needed, which greatly pollutes the argument space. One can only use 4096B of arguments for the kernel, so without shapes a fused kernel could have up to 512 inputs/outputs, but having shapes greatly reduces this number (to less than 144 as seen in this bug ;-) ).

@leezu
Copy link
Contributor Author

leezu commented Nov 7, 2020

Thank you @ptrendx!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants