Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

test_numpy_op.py::test_np_empty_like hangs #18144

Open
szha opened this issue Apr 23, 2020 · 5 comments
Open

test_numpy_op.py::test_np_empty_like hangs #18144

szha opened this issue Apr 23, 2020 · 5 comments

Comments

@szha
Copy link
Member

szha commented Apr 23, 2020

Description

test_numpy_op.py::test_np_empty_like hangs on unix-gpu

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18025/59/pipeline/425

@leezu
Copy link
Contributor

leezu commented Apr 23, 2020

In the linked CI run test_numpy_op.py::test_np_empty_like is not run and thus can't be responsible for the hang. Thus there must be more triggers besides test_np_empty_like.

Related #18090

@haojin2
Copy link
Contributor

haojin2 commented Apr 24, 2020

Wow this issue is VERY INTERESTING, in the first link given in the issue description I'm not even seeing test_np_empty_like being run at all, and the last test run before the final timeout process kill was test_np_bincount. Also as @leezu pointed out in the above comment, even removing test_np_empty_like does not solve the issue. So to conclude, so far I'm not seeing any solid evidence supporting test_np_empty_like to be the root cause for the hang.
To be clear, I'm not saying that I don't think we should re-implement empty_like with a native implementation in the future, simply want to suggest that maybe you guys are attacking the wrong target at this moment.

@leezu
Copy link
Contributor

leezu commented Apr 24, 2020

@haojin2 you can check #18090 for the evidence. In the above commit, the problem is that only empty_like is disabled but not the other numpy operators relying on CustomOp. Doing that in #18151 CI passed without hang 2 times in a row so far. You're right that this doesn't fix the root-cause. The objective here is to restore CI stability

@haojin2
Copy link
Contributor

haojin2 commented Apr 24, 2020

@leezu I understand the goal, but my point is that we should avoid providing un-related info in the issue's description (the hang in the first provided link is not related at all), shouldn't we? It'd be better if link to #18090 was provided in the first place to avoid such confusions, don't you agree?

@leezu
Copy link
Contributor

leezu commented Apr 24, 2020

I agree. #18090 should have been linked but may have been missed unintentionally

leezu added a commit that referenced this issue Apr 24, 2020
These tests are prone to triggering a deadlock. See #18090 #18144
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this issue Jul 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants