-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[CI] unix cpu validation Timeout #15880
Comments
Hey, this is the MXNet Label Bot. |
@mxnet-label-bot add [CI] |
test_random.py:test_shuffle is taking a long time to run. I've seen cpu runtimes between 10 and 50 minutes for that test alone. I've developed a fix and piggy-backed it onto a pending PR of mine: #15882. |
Another PR #15541 Python 3 CPU (runs for 4hours) before terminating! |
This is interesting and we need to figure out if the increased computation leads to the problem. |
@pengzhao-intel I've seen cpu runtimes more than 10 minutes by testing |
Thanks @zixuanweeei Could we statistic and sort the runtime for all cases in CPU side (CPU, CPU+MKL, CPU+MKLDNN)? |
Sure. @pengzhao-intel BTW, I have disabled MKLDNN subgraph backend to see whether it impacts on the efficiency of shuffle operator. The results showed the shuffle operator has the same time cost w/ and w/o MKLDNN subgraph backend. |
Some fixes from PR #15882 and PR #15922 (they have the same fixes on |
From the last comment by @ChaiBapchya, we also found that |
4 hr timeout on the python3 mkldnn-mkl-cpu test. Why is this test still active? It causes a lot of issues with getting PRs through the pipeline. |
4 hr timeout again! MKL CPU #16336 is a step towards getting conclusive evidence towards perennially slow unittests. Hopefully we get clarity onto it once that PR is merged. I am leaning towards disabling this test until timeout issue for mkldnn is fixed! @aaronmarkham |
Python 3 MKL CPU timeout >3hr timeout
Shell script runs for 3h
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/2/pipeline/281/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/1/pipeline/283
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/6/pipeline
But what's the cause?
PR #15794 doesn't make any change to C API.
The text was updated successfully, but these errors were encountered: