Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443

lebeg · 2018-09-03T10:17:59Z

This reverts commit a64cf7d.

Description

Master build failed: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1550/pipeline

Tracking issue: #12442

It is hard to say what test exactly failed the build, so I propose to revert the whole merge for now.

…12157)" This reverts commit a64cf7d.

pengzhao-intel · 2018-09-03T11:05:15Z

FYI, we highly rely on this PR for our further works and our code will be submitted soon.
The revert will delay our progress a lot.

@zheng-da @reminisce could we have a hotfix for the building issue? And our team can co-work with you too @ZhennanQin.

marcoabreu · 2018-09-03T13:26:51Z

Does master fail consistently?

lebeg · 2018-09-03T14:09:35Z

No this was a 1 time failure, build back to normal for now.

reminisce · 2018-09-03T14:17:53Z

I don't see the subgraph API unit tests fail in the linked page. Why do you think it's that PR caused the build failure?

reminisce · 2018-09-03T14:38:41Z

The failure might be related to a bug that has been hidden in MXNet for long time. The subgraph API just might just expose the bug in the CI. I observed that the state of a stateful operator cannot be destructed in threaded engine (naive engine is fine). I suggest we fix that bug instead of reverting the subgraph API PR to see the effect.

lebeg · 2018-09-03T14:58:12Z

My thought was the following:

Git history for tests/python/gpu/test_operator_gpu.py shows that Subgraph API for integrating accelerators with MXNet (#12157) was the last commit for this file.

commit a64cf7d9c8c1c473e201b5bd68ab9af6bf7365ba
Author: reminisce <[email protected]>
Date:   Thu Aug 30 19:13:33 2018 -0700

    Subgraph API for integrating accelerators with MXNet (#12157)

commit 2193819d40792d0526118819b991111e7ac4162d
Author: Sam Skalicky <[email protected]>
Date:   Sun Aug 12 12:43:19 2018 -0700

    [MXNET-788] Fix for issue #11733 pooling op test (#12067)

The build that failed was from 03-Sep-2018 06:00.

Based on multiple errors not seen before and probably related to inconsistent state of CUDA memory:

test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=104987558 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_basic ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2134146737 to reproduce.
ERROR
test_operator_gpu.test_exc_multiple_waits ... ok
test_operator_gpu.test_lstm_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=200476953 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_setitem ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2082345391 to reproduce.
ERROR
test_operator_gpu.test_exc_post_fail ... ok
test_operator_gpu.test_gru_sym ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1532640391 to reproduce.
ERROR
test_operator_gpu.test_exc_mutable_var_fail ... ok
test_operator_gpu.test_sparse_nd_slice ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1828661033 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwise ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1460065938 to reproduce.
ERROR
test_operator_gpu.test_gru_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=16762643 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwisesum ... [06:59:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:622: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 639:     8 Aborted                 (core dumped) nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu

I looked at what test was executed before:

test_operator_gpu.test_exc_imperative ... ok
test_operator_gpu.test_subgraph_exe ... [06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp0. Excluding nodes _plus0, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node exp1. Excluding nodes _plus1, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:741: The graph has no attribute of subgraph_property attached. The original graph is returned.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:741: The graph has no attribute of subgraph_property attached. The original graph is returned.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node sin3. Excluding nodes _plus3, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node sin3. Excluding nodes _plus3, and retrying
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node sin3. Excluding nodes _plus3, and retrying
[06:59:45] src/executor/graph_executor.cc:1486: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[06:59:45] src/operator/subgraph/partition_graph.cc:335: Found a cycle when BFS from node sin3. Excluding nodes _plus3, and retrying

All of this made me think the issue might be related to the mentioned PR #12157.

lebeg · 2018-09-03T15:03:27Z

Btw: I really support the approach of fixing bugs instead of reverting changes. The verification pipeline is currently back to normal, so this revert doesn't need an urgent merge. I will monitor it to see if there will be consecutive failures.

zheng-da · 2018-09-03T20:00:35Z

@lebeg I don't understand why you think the error is caused by the subgraph API? It seems it's based on your speculation. Does the error happen consistently (even if it's once every few hundred runs)?

Even if the error is caused by the subgraph PR, it's still possible that there is a bug in mxnet that is triggered by the tests in the subgraph PR. let's have many more tests before reaching any conclusion.

lebeg · 2018-09-03T20:22:17Z

@zheng-da is your point that data I provided above is not enough to come to such conclusion?

Does the error happen consistently (even if it's once every few hundred runs)?

We didn't have that much data, so I started to collect it here #12445 and here http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/activity?branch=PR-12445

The job is triggered every hour.

Currently it did break due to the SHA was pointing to an not anymore existing tvm commit. Fix is in #12448

zheng-da · 2018-09-03T20:26:33Z

@lebeg I might miss something. I don't know what this file refers to in #12443 (comment).

My understanding is that the failure happens once so far and the subgraph tests happen to occur before the failed tests.

lebeg · 2018-09-03T20:31:41Z

@zheng-da I'm really sorry! I forgot to provide a link, it's this file: tests/python/gpu/test_operator_gpu.py
Will update the description in the comment.

zheng-da · 2018-09-03T20:47:17Z

@lebeg i see. my experience with the CI is that this CUDA memory error happened before (at least quite a few times). I originally thought this error happened because GPUs have too much workload and somehow run out of memory or have resource conflicts. Maybe CI has improved and can prevent this kind of things from happening?
Anyway, let's wait for your test results and see if this error happens consistently.

reminisce · 2018-09-04T02:43:02Z

I had a similar experience when doing the check for a release vote before this PR was merged. If I executed all the unit test files with nosetests, the gpu test would fail like this. If I executed them one by one, all the tests can pass.

pengzhao-intel · 2018-09-04T07:31:38Z

@reminisce Indeed! We encountered and stuck by this issue for a long time in PR #10921

In PR #10921, the new tests will fail in the continuously run but all passed in the separate run under the GPU context. On the other hand, all CPU context will pass either two approaches.

Anyway, this GPU issue needs to be resolved. And the observed issue may be NOT related with subgraph PR.

I had a similar experience when doing the check for a release vote before this PR was merged. If I executed all the unit test files with nosetests, the gpu test would fail like this. If I executed them one by one, all the tests can pass.

marcoabreu · 2018-09-04T08:10:41Z

Did running the tests with valgrind and cuda memcheck give any leads? It's usually just an indicator that some test caused memory corruption that causes other ones to fail downstream.

lebeg · 2018-09-04T08:37:38Z

@reminisce

I had a similar experience when doing the check for a release vote before this PR was merged.

I didn't know that. I'm looking at the build history of the #12157 PR and see only failures related to test_mkldnn.test_activation issue tracked #12377 and test_operator_gpu.test_l2_normalization issue tracked #12417.

@pengzhao-intel

We encountered and stuck by this issue for a long time in PR #10921

But I did find an very similar error for #10921 here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10921/23/pipeline/

Which is clearly before the subgraph API was introduced.

Thank you for providing this information. I have opened an issue #12453 about this and it can be referenced in any future cases of such failures.

I'm closing this PR and the related issue.

Revert "Subgraph API for integrating accelerators with MXNet (apache#…

ceaac8c

…12157)" This reverts commit a64cf7d.

lebeg requested a review from anirudh2290 as a code owner September 3, 2018 10:17

lebeg mentioned this pull request Sep 3, 2018

Master build fails due to Subgraph API commit #12442

Closed

lebeg closed this Sep 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443

Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443

lebeg commented Sep 3, 2018 •

edited

Loading

pengzhao-intel commented Sep 3, 2018

marcoabreu commented Sep 3, 2018

lebeg commented Sep 3, 2018

reminisce commented Sep 3, 2018

reminisce commented Sep 3, 2018

lebeg commented Sep 3, 2018 •

edited

Loading

lebeg commented Sep 3, 2018

zheng-da commented Sep 3, 2018

lebeg commented Sep 3, 2018

zheng-da commented Sep 3, 2018

lebeg commented Sep 3, 2018

zheng-da commented Sep 3, 2018

reminisce commented Sep 4, 2018

pengzhao-intel commented Sep 4, 2018 •

edited

Loading

marcoabreu commented Sep 4, 2018

lebeg commented Sep 4, 2018

Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443

Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443

Conversation

lebeg commented Sep 3, 2018 • edited Loading

Description

pengzhao-intel commented Sep 3, 2018

marcoabreu commented Sep 3, 2018

lebeg commented Sep 3, 2018

reminisce commented Sep 3, 2018

reminisce commented Sep 3, 2018

lebeg commented Sep 3, 2018 • edited Loading

lebeg commented Sep 3, 2018

zheng-da commented Sep 3, 2018

lebeg commented Sep 3, 2018

zheng-da commented Sep 3, 2018

lebeg commented Sep 3, 2018

zheng-da commented Sep 3, 2018

reminisce commented Sep 4, 2018

pengzhao-intel commented Sep 4, 2018 • edited Loading

marcoabreu commented Sep 4, 2018

lebeg commented Sep 4, 2018

lebeg commented Sep 3, 2018 •

edited

Loading

lebeg commented Sep 3, 2018 •

edited

Loading

pengzhao-intel commented Sep 4, 2018 •

edited

Loading