-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[1.x / 1.8] Regression in runtime fusion #19316
Comments
Well, I kind of agree it is a fusion bug, but this specific failure should also be tackled on the Gluon-NLP side. Having so many |
The concern here is that the regression breaks models that worked without problem. |
But is it actually a regression? Do 1.6/1.7 work with the same model? I don't think they would work. |
Yes, this works in 1.6. There are separate integration tests for 1.6 and nightly build in GluonNLP and the 1.6 tests pass. You can refer to dmlc/gluon-nlp#1237 (comment) which introduced a version of MXNet containing the regression. I confirmed that the regression is still present in latest 1.x branch via dmlc/gluon-nlp#1384 |
Ok, so this suggests it broke in 1.7 then. It is quite strange though since if you look at https://github.com/apache/incubator-mxnet/commits/1.8.0.rc1/src/operator/fusion there are only a few commits that went into fusion between 1.6 and 1.7 and none of them looks like something that could affect this. I will look into this. |
Thank you. You may be right that it's an edge case. Btw, more evidence that this works in 1.6 is that you actually fixed a performance bug based on the GluonNLP XLNet model in 1.6: #17105 |
Well, that is actually not really an evidence because I am using that exact model (from GluonNLP 0.9.0) for my performance experiments in the current PR (that said, I'm using NVIDIA container, but 1.8rc1 is merged there and I don't believe there are any differences in fusion between NVIDIA version and upstream) and it does not show that issue. So my assumption is that this model changed in the meantime. |
Any change would be verified by the GluonNLP CI. I believe fusion is activated by default in 1.6, thus the CI should test fusion? These are the outputs of the CI for MXNet 1.6 and 1.x:
|
@leezu I'm working on this now. There is something fishy going on here - I tried reproducing it by compiling |
Update: the code generation itself seems to be working fine - it is the subgraph it gets that is wrong. I'm not yet sure if this is fault of some bug in common subexpression elimination (since with |
Another update - the problem comes from the eliminate common expressions. In the model there seem to be 12 I'm still trying to understand why there is this discrepancy in execution. |
Ok, that was pretty stupid. The reason why I will make a PR shortly to allow eliminating expressions with tempspace to unblock upgrading GluonNLP to 1.x, then will think about the changes to code generation of fusion, so that this error is not triggered. Generally speaking the problem comes from the fact that we send all the inputs/outputs shapes to the fused kernel even if they are not needed, which greatly pollutes the argument space. One can only use 4096B of arguments for the kernel, so without shapes a fused kernel could have up to 512 inputs/outputs, but having shapes greatly reduces this number (to less than 144 as seen in this bug ;-) ). |
Thank you @ptrendx! |
In MXNet 1.8, XLNet model fails to run due to bug in runtime fusion:
dmlc/gluon-nlp#1230 (comment)
The issue persists with the latest nightly builds of the 1.x branch, ie applies to the 1.8 release. See dmlc/gluon-nlp#1384
The text was updated successfully, but these errors were encountered: