Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Switch to GCC 8 for distribution build #19185

Merged
merged 4 commits into from
Oct 5, 2020
Merged

Switch to GCC 8 for distribution build #19185

merged 4 commits into from
Oct 5, 2020

Conversation

leezu
Copy link
Contributor

@leezu leezu commented Sep 19, 2020

Description

Resubmit #19034 which was temporarily reverted due to oneDNN issues with GCC 8.

@TaoLv can your team help debug / fix the oneDNN issues?

When both gcc8 + oneDNN 1.6.3 is present, we get the following nan bugs:

[2020-09-17T17:48:04.979Z] ______________________ test_dc_hybridblock_deferred_init _______________________
[2020-09-17T17:48:04.979Z] [gw0] linux -- Python 3.6.9 /opt/rh/rh-python36/root/usr/bin/python3
[2020-09-17T17:48:04.979Z] 
[2020-09-17T17:48:04.979Z]     def test_dc_hybridblock_deferred_init():
[2020-09-17T17:48:04.979Z]         class MyBlock(mx.gluon.HybridBlock):
[2020-09-17T17:48:04.979Z]             def __init__(self):
[2020-09-17T17:48:04.979Z]                 super().__init__()
[2020-09-17T17:48:04.979Z]                 self.dense = mx.gluon.nn.Dense(units=10)
[2020-09-17T17:48:04.979Z]                 self.weight = mx.gluon.Parameter('weight', allow_deferred_init=True)
[2020-09-17T17:48:04.979Z]     
[2020-09-17T17:48:04.979Z]             def infer_shape(self, x):
[2020-09-17T17:48:04.979Z]                 self.weight.shape = (x.shape[1], )
[2020-09-17T17:48:04.979Z]     
[2020-09-17T17:48:04.979Z]             def forward(self, x):
[2020-09-17T17:48:04.979Z]                 return self.dense(x) + self.weight.data(x.context)
[2020-09-17T17:48:04.979Z]     
[2020-09-17T17:48:04.979Z]         net = MyBlock()
[2020-09-17T17:48:04.979Z]         net.initialize()
[2020-09-17T17:48:04.979Z] >       _assert_dc_gluon(_dc_gluon_simple_setup, net, numpy=False)
[2020-09-17T17:48:04.979Z] 
[2020-09-17T17:48:04.979Z] tests/python/unittest/test_deferred_compute.py:504: 
[2020-09-17T17:48:04.979Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2020-09-17T17:48:04.979Z] tests/python/unittest/test_deferred_compute.py:421: in _assert_dc_gluon
[2020-09-17T17:48:04.979Z]     _all_same(ys_np, ys_hybrid_np)
[2020-09-17T17:48:04.979Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2020-09-17T17:48:04.979Z] 
[2020-09-17T17:48:04.979Z] arrays1 = [array([        nan,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z]         0.07460972, -0.08127148, -0.32424796,...33878, -0.10624887,
[2020-09-17T17:48:04.979Z]         0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z]       dtype=float32), ...]
[2020-09-17T17:48:04.979Z] arrays2 = [array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z]         0.07460972, -0.08127148, -0.32424796,...33878, -0.10624887,
[2020-09-17T17:48:04.979Z]         0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z]       dtype=float32), ...]
[2020-09-17T17:48:04.979Z] message = ''
[2020-09-17T17:48:04.979Z] 
[2020-09-17T17:48:04.979Z]     def _all_same(arrays1, arrays2, message=''):
[2020-09-17T17:48:04.979Z]         same = all(np.array_equal(a1, a2) for a1, a2 in zip(arrays1, arrays2))
[2020-09-17T17:48:04.979Z]         if not same:
[2020-09-17T17:48:04.979Z] >           raise AssertionError('Arrays not equal ({}):\n{}\n\n{}'.format(message, arrays1, arrays2))
[2020-09-17T17:48:04.979Z] E           AssertionError: Arrays not equal ():
[2020-09-17T17:48:04.979Z] E           [array([        nan,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([        nan,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([        nan,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32)]
[2020-09-17T17:48:04.979Z] E           
[2020-09-17T17:48:04.979Z] E           [array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32), array([ 0.01286458,  0.2107217 , -0.06851891,  0.16233878, -0.10624887,
[2020-09-17T17:48:04.979Z] E                   0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ],
[2020-09-17T17:48:04.979Z] E                 dtype=float32)]

Reverting either to gcc7 (which was done) or reverting the oneDNN update (#19180) fixes the issue.

@mxnet-bot
Copy link

Hey @leezu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-cpu, unix-cpu, windows-gpu, website, clang, centos-gpu, unix-gpu, centos-cpu, edge, sanity, miscellaneous]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@leezu
Copy link
Contributor Author

leezu commented Sep 19, 2020

Here is a screenshot that shows that the build passes when OneDNN update is reverted:

image

@TaoLv
Copy link
Member

TaoLv commented Sep 19, 2020

@leezu Sure, we will take a look at the issue. Do you have steps for us to reproduce it?

@DickJC123
Copy link
Contributor

@pioy
Copy link

pioy commented Sep 22, 2020

I have reproduced locally on centos.
Interesting thing is that if I run only this failing test it passes. Also it passes if I run all tests from the test_deferred_compute.py.
The test fails if it runs with all other unit tests.
WIP.

@pioy
Copy link

pioy commented Sep 23, 2020

I was able to reduce number of tests case in the reproduction sequence to to cases:

OMP_NUM_THREADS=1 pytest --maxfail=1 -s -m 'not serial' -n 0 --durations=50 --verbose \
'tests/python/unittest/test_contrib_intgemm.py::test_contrib_intgemm_multiply[api1-16-192-4]'\
tests/python/unittest/test_deferred_compute.py::test_dc_hybridblock_deferred_init

Interesting thing is that the introduction of another test in the middle causes the sequence passes:

OMP_NUM_THREADS=1 pytest --maxfail=1 -s -m 'not serial' -n 0 --durations=50 --verbose \
'tests/python/unittest/test_contrib_intgemm.py::test_contrib_intgemm_multiply[api1-16-192-4]'\
'tests/python/unittest/test_contrib_intgemm.py::test_contrib_intgemm_multiply[api1-16-192-2]'\
tests/python/unittest/test_deferred_compute.py::test_dc_hybridblock_deferred_init

I edited the test test_contrib_intgemm_multiply eliminating calls to test_contrib_intgemm_multiply, what fixed the issue.
It looks like it may be something wrong with contrib_intgemm_fully_connected op.

@pioy
Copy link

pioy commented Sep 24, 2020

Hi,
I pinned down the issue to onednn module. I will contact the team to resolve it.

@pioy
Copy link

pioy commented Sep 25, 2020

Today, on request OneDNN team, I came up with easy reproduction procedure to make easy finding the culprit and fixing it.

@pioy
Copy link

pioy commented Sep 28, 2020

I've got the engineering fix that fixes the issue.
Now we need wait till this fix appears in the onednn repository.

@leezu
Copy link
Contributor Author

leezu commented Sep 28, 2020

Thank you @pioy. Can you provide more information about the impact of the bug and the conditions in which it is triggered?

For example, the upcoming 1.8 release includes oneDNN 1.6.3 and presumably is also affected by the bug. In that case we may need to include your fix in the 1.8 release as well, to ensure correctness of mxnet-cpu package which now includes oneDNN by default.

cc @samskalicky

@pioy
Copy link

pioy commented Sep 29, 2020

The bugs occurs for onednn GEMM calculations with small dimensions (1<n<16).
The bug is triggered by the range of values, that can be interpreted as NaN, in a one zmm registers (zmm24, zmm25, zmm26, or zmm27) just before calling the gemm kernel.
Those values can be remaining of other calculations, likely integer operations (case of this PR).
I assume that float kernels does not return NaN in properly configured pipeline.
The NaN values, if they are not overwritten by other avx512 kernels, may stay long; so they may come from operations that were executed much earlier in the pipeline.

In result of the bug NaNs propagates to the result array. What may terminate execution of operators.

The fix has been merged into master/1.8/1.7. It's ready for customer testing.
See oneapi-src/oneDNN@5ce95ef.
There are some other fixes to be merged into 1.6 branch. The tag v1.6.4 will be added after those fixes get merged.

@leezu
Copy link
Contributor Author

leezu commented Sep 29, 2020

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@leezu
Copy link
Contributor Author

leezu commented Sep 29, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@leezu leezu added the pr-awaiting-review PR is waiting for code review label Sep 29, 2020
@pioy
Copy link

pioy commented Oct 1, 2020

@szha szha merged commit 3c5beb3 into apache:master Oct 5, 2020
@leezu leezu deleted the gcc8 branch October 5, 2020 18:42
@access2rohit access2rohit mentioned this pull request Feb 17, 2021
13 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants