Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443

Closed
wants to merge 1 commit into from

Conversation

lebeg
Copy link
Contributor

@lebeg lebeg commented Sep 3, 2018

This reverts commit a64cf7d.

Description

Master build failed: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1550/pipeline

Tracking issue: #12442

It is hard to say what test exactly failed the build, so I propose to revert the whole merge for now.

@pengzhao-intel
Copy link
Contributor

FYI, we highly rely on this PR for our further works and our code will be submitted soon.
The revert will delay our progress a lot.

@zheng-da @reminisce could we have a hotfix for the building issue? And our team can co-work with you too @ZhennanQin.

@marcoabreu
Copy link
Contributor

Does master fail consistently?

@lebeg
Copy link
Contributor Author

lebeg commented Sep 3, 2018

No this was a 1 time failure, build back to normal for now.

@reminisce
Copy link
Contributor

I don't see the subgraph API unit tests fail in the linked page. Why do you think it's that PR caused the build failure?

@reminisce
Copy link
Contributor

The failure might be related to a bug that has been hidden in MXNet for long time. The subgraph API just might just expose the bug in the CI. I observed that the state of a stateful operator cannot be destructed in threaded engine (naive engine is fine). I suggest we fix that bug instead of reverting the subgraph API PR to see the effect.

@lebeg
Copy link
Contributor Author

lebeg commented Sep 3, 2018

My thought was the following:

Git history for tests/python/gpu/test_operator_gpu.py shows that Subgraph API for integrating accelerators with MXNet (#12157) was the last commit for this file.

commit a64cf7d9c8c1c473e201b5bd68ab9af6bf7365ba
Author: reminisce <[email protected]>
Date:   Thu Aug 30 19:13:33 2018 -0700

    Subgraph API for integrating accelerators with MXNet (#12157)

commit 2193819d40792d0526118819b991111e7ac4162d
Author: Sam Skalicky <[email protected]>
Date:   Sun Aug 12 12:43:19 2018 -0700

    [MXNET-788] Fix for issue #11733 pooling op test (#12067)

The build that failed was from 03-Sep-2018 06:00.

Based on multiple errors not seen before and probably related to inconsistent state of CUDA memory:

test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=104987558 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_basic ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2134146737 to reproduce.
ERROR
test_operator_gpu.test_exc_multiple_waits ... ok
test_operator_gpu.test_lstm_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=200476953 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_setitem ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2082345391 to reproduce.
ERROR
test_operator_gpu.test_exc_post_fail ... ok
test_operator_gpu.test_gru_sym ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1532640391 to reproduce.
ERROR
test_operator_gpu.test_exc_mutable_var_fail ... ok
test_operator_gpu.test_sparse_nd_slice ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1828661033 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwise ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1460065938 to reproduce.
ERROR
test_operator_gpu.test_gru_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=16762643 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwisesum ... [06:59:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:622: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 639:     8 Aborted                 (core dumped) nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu

I looked at what test was executed before:

test_operator_gpu.test_exc_imperative ... ok
test_operator_gpu.test_subgraph_exe ... [06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:741: The graph has no attribute of subgraph_property attached. The original graph is returned.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:741: The graph has no attribute of subgraph_property attached. The original graph is returned.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node sin3. Excluding nodes _plus3, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node sin3. Excluding nodes _plus3, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node sin3. Excluding nodes _plus3, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node sin3. Excluding nodes _plus3, and retrying

All of this made me think the issue might be related to the mentioned PR #12157.

@lebeg
Copy link
Contributor Author

lebeg commented Sep 3, 2018

Btw: I really support the approach of fixing bugs instead of reverting changes. The verification pipeline is currently back to normal, so this revert doesn't need an urgent merge. I will monitor it to see if there will be consecutive failures.

@zheng-da
Copy link
Contributor

zheng-da commented Sep 3, 2018

@lebeg I don't understand why you think the error is caused by the subgraph API? It seems it's based on your speculation. Does the error happen consistently (even if it's once every few hundred runs)?

Even if the error is caused by the subgraph PR, it's still possible that there is a bug in mxnet that is triggered by the tests in the subgraph PR. let's have many more tests before reaching any conclusion.

@lebeg
Copy link
Contributor Author

lebeg commented Sep 3, 2018

@zheng-da is your point that data I provided above is not enough to come to such conclusion?

Does the error happen consistently (even if it's once every few hundred runs)?

We didn't have that much data, so I started to collect it here #12445 and here http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/activity?branch=PR-12445

The job is triggered every hour.

Currently it did break due to the SHA was pointing to an not anymore existing tvm commit. Fix is in #12448

@zheng-da
Copy link
Contributor

zheng-da commented Sep 3, 2018

@lebeg I might miss something. I don't know what this file refers to in #12443 (comment).

My understanding is that the failure happens once so far and the subgraph tests happen to occur before the failed tests.

@lebeg
Copy link
Contributor Author

lebeg commented Sep 3, 2018

@zheng-da I'm really sorry! I forgot to provide a link, it's this file: tests/python/gpu/test_operator_gpu.py
Will update the description in the comment.

@zheng-da
Copy link
Contributor

zheng-da commented Sep 3, 2018

@lebeg i see. my experience with the CI is that this CUDA memory error happened before (at least quite a few times). I originally thought this error happened because GPUs have too much workload and somehow run out of memory or have resource conflicts. Maybe CI has improved and can prevent this kind of things from happening?
Anyway, let's wait for your test results and see if this error happens consistently.

@reminisce
Copy link
Contributor

I had a similar experience when doing the check for a release vote before this PR was merged. If I executed all the unit test files with nosetests, the gpu test would fail like this. If I executed them one by one, all the tests can pass.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Sep 4, 2018

@reminisce Indeed! We encountered and stuck by this issue for a long time in PR #10921

In PR #10921, the new tests will fail in the continuously run but all passed in the separate run under the GPU context. On the other hand, all CPU context will pass either two approaches.

Anyway, this GPU issue needs to be resolved. And the observed issue may be NOT related with subgraph PR.

I had a similar experience when doing the check for a release vote before this PR was merged. If I executed all the unit test files with nosetests, the gpu test would fail like this. If I executed them one by one, all the tests can pass.

@marcoabreu
Copy link
Contributor

Did running the tests with valgrind and cuda memcheck give any leads? It's usually just an indicator that some test caused memory corruption that causes other ones to fail downstream.

@lebeg
Copy link
Contributor Author

lebeg commented Sep 4, 2018

@reminisce

I had a similar experience when doing the check for a release vote before this PR was merged.

I didn't know that. I'm looking at the build history of the #12157 PR and see only failures related to test_mkldnn.test_activation issue tracked #12377 and test_operator_gpu.test_l2_normalization issue tracked #12417.

@pengzhao-intel

We encountered and stuck by this issue for a long time in PR #10921

But I did find an very similar error for #10921 here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10921/23/pipeline/

Which is clearly before the subgraph API was introduced.

Thank you for providing this information. I have opened an issue #12453 about this and it can be referenced in any future cases of such failures.

I'm closing this PR and the related issue.

@lebeg lebeg closed this Sep 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants