-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Switch to GCC 8 for distribution build #19185
Conversation
Hey @leezu , Thanks for submitting the PR
CI supported jobs: [windows-cpu, unix-cpu, windows-gpu, website, clang, centos-gpu, unix-gpu, centos-cpu, edge, sanity, miscellaneous] Note: |
@leezu Sure, we will take a look at the issue. Do you have steps for us to reproduce it? |
I'm seeing this error too in my PR: See https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-cpu/detail/PR-19175/5/pipeline |
I have reproduced locally on centos. |
I was able to reduce number of tests case in the reproduction sequence to to cases:
Interesting thing is that the introduction of another test in the middle causes the sequence passes:
I edited the test test_contrib_intgemm_multiply eliminating calls to test_contrib_intgemm_multiply, what fixed the issue. |
Hi, |
Today, on request OneDNN team, I came up with easy reproduction procedure to make easy finding the culprit and fixing it. |
I've got the engineering fix that fixes the issue. |
Thank you @pioy. Can you provide more information about the impact of the bug and the conditions in which it is triggered? For example, the upcoming 1.8 release includes oneDNN 1.6.3 and presumably is also affected by the bug. In that case we may need to include your fix in the 1.8 release as well, to ensure correctness of mxnet-cpu package which now includes oneDNN by default. cc @samskalicky |
The bugs occurs for onednn GEMM calculations with small dimensions (1<n<16). In result of the bug NaNs propagates to the result array. What may terminate execution of operators. The fix has been merged into master/1.8/1.7. It's ready for customer testing. |
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
@mxnet-bot run ci [unix-cpu] |
Jenkins CI successfully triggered : [unix-cpu] |
OneDNN v1.6.4 has been released: https://github.com/oneapi-src/oneDNN/releases/tag/v1.6.4. |
Description
Resubmit #19034 which was temporarily reverted due to oneDNN issues with GCC 8.
@TaoLv can your team help debug / fix the oneDNN issues?
When both gcc8 + oneDNN 1.6.3 is present, we get the following nan bugs:
Reverting either to gcc7 (which was done) or reverting the oneDNN update (#19180) fixes the issue.