-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443
Conversation
FYI, we highly rely on this PR for our further works and our code will be submitted soon. @zheng-da @reminisce could we have a hotfix for the building issue? And our team can co-work with you too @ZhennanQin. |
Does master fail consistently? |
No this was a 1 time failure, build back to normal for now. |
I don't see the subgraph API unit tests fail in the linked page. Why do you think it's that PR caused the build failure? |
The failure might be related to a bug that has been hidden in MXNet for long time. The subgraph API just might just expose the bug in the CI. I observed that the state of a stateful operator cannot be destructed in threaded engine (naive engine is fine). I suggest we fix that bug instead of reverting the subgraph API PR to see the effect. |
My thought was the following: Git history for tests/python/gpu/test_operator_gpu.py shows that Subgraph API for integrating accelerators with MXNet (#12157) was the last commit for this file.
The build that failed was from 03-Sep-2018 06:00. Based on multiple errors not seen before and probably related to inconsistent state of CUDA memory:
I looked at what test was executed before:
All of this made me think the issue might be related to the mentioned PR #12157. |
Btw: I really support the approach of fixing bugs instead of reverting changes. The verification pipeline is currently back to normal, so this revert doesn't need an urgent merge. I will monitor it to see if there will be consecutive failures. |
@lebeg I don't understand why you think the error is caused by the subgraph API? It seems it's based on your speculation. Does the error happen consistently (even if it's once every few hundred runs)? Even if the error is caused by the subgraph PR, it's still possible that there is a bug in mxnet that is triggered by the tests in the subgraph PR. let's have many more tests before reaching any conclusion. |
@zheng-da is your point that data I provided above is not enough to come to such conclusion?
We didn't have that much data, so I started to collect it here #12445 and here http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/activity?branch=PR-12445 The job is triggered every hour. Currently it did break due to the SHA was pointing to an not anymore existing tvm commit. Fix is in #12448 |
@lebeg I might miss something. I don't know what My understanding is that the failure happens once so far and the subgraph tests happen to occur before the failed tests. |
@zheng-da I'm really sorry! I forgot to provide a link, it's this file: |
@lebeg i see. my experience with the CI is that this CUDA memory error happened before (at least quite a few times). I originally thought this error happened because GPUs have too much workload and somehow run out of memory or have resource conflicts. Maybe CI has improved and can prevent this kind of things from happening? |
I had a similar experience when doing the check for a release vote before this PR was merged. If I executed all the unit test files with nosetests, the gpu test would fail like this. If I executed them one by one, all the tests can pass. |
@reminisce Indeed! We encountered and stuck by this issue for a long time in PR #10921 In PR #10921, the new tests will fail in the continuously run but all passed in the separate run under the GPU context. On the other hand, all CPU context will pass either two approaches. Anyway, this GPU issue needs to be resolved. And the observed issue may be NOT related with subgraph PR.
|
Did running the tests with valgrind and cuda memcheck give any leads? It's usually just an indicator that some test caused memory corruption that causes other ones to fail downstream. |
I didn't know that. I'm looking at the build history of the #12157 PR and see only failures related to test_mkldnn.test_activation issue tracked #12377 and test_operator_gpu.test_l2_normalization issue tracked #12417.
But I did find an very similar error for #10921 here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10921/23/pipeline/ Which is clearly before the subgraph API was introduced. Thank you for providing this information. I have opened an issue #12453 about this and it can be referenced in any future cases of such failures. I'm closing this PR and the related issue. |
This reverts commit a64cf7d.
Description
Master build failed: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1550/pipeline
Tracking issue: #12442
It is hard to say what test exactly failed the build, so I propose to revert the whole merge for now.