Fix flaky test test_deconvolution #11630

anirudh2290 · 2018-07-10T15:44:25Z

Description

This is to fix the flaky test: #10973. Prereq PR: #11470.
cublasSgemm and cublasDgemm are currently used for linear algebra operations when cudnn is disabled for the Convolution and Deconvolution operators. The test_deconvolution test on gpu fails on the input gradients comparison for conv and deconv operators. When looking at the failures, the difference in the gradients are as big as 0.1. Replacing cublasSgemm with cublasSgemmex while keeping all other code the same, causes the failure to go away. Also, the failure doesnt happen on CPU with openblas or MKLDNN or on gpu with cudnn enabled. If both APIs work as documented, both cases should have seen the failure. My guess is that cublasSgemm is doing the conversion of float32 to float16 before doing the conversion, causing the precision issues. Digging through the cublas documentation, I couldn't find enough information on the computation for cublasSgemm.
Please see: http://lutgw1.lunet.edu/cuda/pdf/CUBLAS_Library.pdf 2.8.11 (cublassgemmex) and 2.7.1 (cublassgemm).

@DickJC123 @ptrendx any inputs on if there should be any difference in behavior of two apis when all tensors have float32 dtype.

The important changes are in the file: src/operator/linalg_impl.h

marcoabreu

Sorry if I missed something , I'm on my phone right now

marcoabreu · 2018-07-10T15:51:49Z

Jenkinsfile

+    'Python3: MKLDNN-GPU-NOCUDNN': {
+      node('mxnetlinux-gpu') {
+        ws('workspace/ut-python3-mkldnn-gpu-nocudnn') {
+          withEnv(['CUDNN_DISABLED=ON']) {


Please make a new script to call the unit tests in the runtime functions (you might just nest the calls to prevent copy and paste). Otherwise, the calls will not be reproducible- also, the environment variables will not be available in the docker container.
I'm on my phone, so please excusme if the question is answered below: why do we have to set that variable if we already compile without cudnn?

Thanks for the info. I have fixed it in PR: #11470. We are removing the C API in that PR so there is no way of finding out whether cudnn is enabled.

Couldn't we just always catch the error and skip it if it contains the message about cudnn not being available?

well that approach would also work. but this is cleaner. for example, someone adding a new op which doesnt support cudnn off may have a different wording for example.

Well then the error would be made visible in CI and the person would have to adjust the error message, right? Ideally, we'd have standardized error messages (e.g. error codes).

I don't know if this is cleaner because that way you are catching all errors independently of whether they are actually related to cudnn. Imagine somebody makes a change to the cudnn behaviour but there is is still a bug - the error gets then omitted and we don't notice it. I'd rather have detailed and specific error handling rather than a global try-catch. What do you think?

well this test will only be ignored if the decorator is added, so only when cudnn off and for operators that dont support the configuration. People making changes to cudnn behavior should not be affected by this. Its a global try catch when cudnn is disabled for ops that dont support the configuration.

marcoabreu · 2018-07-10T15:54:17Z

tests/python/unittest/common.py

+        @make_decorator(orig_test)
+        def test_new(*args, **kwargs):
+            cudnn_disabled = (os.getenv('CUDNN_DISABLED') == "ON")
+            if not cudnn_disabled or mx.context.current_context().device_type == 'cpu':


Ah that's the environment variable. Could we give it a name that clearly shows that this environment variable is only for test purposes?

I have renamed it to CUDNN_OFF_TEST_ONLY in the PR: #11470. Let us move the discussion for cudnn off tests to that PR.

marcoabreu · 2018-07-10T15:56:08Z

tests/python/unittest/common.py

+            if not cudnn_disabled or mx.context.current_context().device_type == 'cpu':
+                orig_test(*args, **kwargs)
+            else:
+                if assertion_error:


When do we want to differentiate between assertion errors? I think we should always handle them, no matter what.

this is to differentiate between failures and the ops that dont support cudnn disabled. for example some ops when used with cudnn disabled would just throw an exception saying not supported. others would proceed but would compute wrongly and throw an assertion error.

I see, thanks for elaborating. It seems to be like that you revealed a bug in our backend about the handling of the missing cudnn.

I don't think this is what should happen. If a user uses an operator although it's not supported, it should not make a false calculation but rather throw the error properly. This method would mask a problem that a user would silently encounter. We should rather investigate why the error is not being thrown and fix that.

Also, is there no CPU implementation of these operators we could fall back to instead?

yep the issue was uncovered as part of the PR and is documented here: #11568 . As I have mentioned in #11470, this is not something we should be blocking the PR for. This issue seems could have been there for a long time. Its possible that there are user codes depending on the buggy behavior and raising an exception would be a breaking change for these users. Its better to fix the issue with cudnn enabled rather than throwing an exception saying cudnn not supported atleast for 1.3.

we need to still test it when cudnn is enabled.

We currently have a 0 tolerance policy for faulty tests and deactivate them temporarily while taking reduced test coverage as a pill we have to swallow. It is a release requirement that all tests have to be enabled, so don't worry: it won't stay disables for long.

For Amazon employees I have received information that faulty and flaky tests have to be handled as sev2.5

why disable the test when it can be avoided. If it is for visibility then we already have an open issue. we can also call it a disabled test since it is disabled for cudnn off.

Sure, as you prefer. Just wanted to avoid having this hack in there, but I don't mind. Could you please add a comment in that test, linking to the issue and stating that the test is disabled when running without cudnn?

marcoabreu

Please add the comment about the assertion_error being temporary and link the associated issue.

Please don't forget to close the issues associated with the disabled tests after merging.

lupesko

Left a small comment.
Also, please update the issue description - the Checklist is left unchanged from the template... you should either tick the relevant boxes or remove them altogether if not relevant. Thanks for fixing the issue!

lupesko · 2018-07-18T05:41:18Z

src/operator/linalg_impl.h

+        A.stride_, &beta, C.dptr_, C.stride_))                             \
+  }
+
+#if CUDA_VERSION >= 7050


A comment here explaining the #if directive would be helpful for other (and yourself) in the future. Can we add it please?

i have added the comment. also removed the checklist for the current PR.

piiswrong · 2018-07-18T22:23:31Z

@asmushetzel @DickJC123

asmushetzel · 2018-07-19T08:28:11Z

While the changes seem all valid and fine, this remains still a mystery as there should be no difference between the two APIs if all operands are float32 (as anirudh stated above). A bit confused though as the original PR description mentions float16 (why?).
We definitely should try to get more info from NVidia in order to figure out what happened there.

anirudh2290 · 2018-07-19T16:21:43Z

@asmushetzel sorry about the confusion. To clarify, the documentation states that both APIs should behave exactly the same when all tensors have float32 dtype. I am guessing that the difference in behavior could then be because cublassgemm is converting tensors to float16 dtype (a bug?) for computation and could be the reason for the precision issues. I have asked Nvidia folks for inputs but haven't heard back from them yet.

anirudh2290 · 2018-07-25T05:54:02Z

I will wait till end of this week to hear back from Nvidia folks before merging this PR.

* Replace cublassgemm with cublassgemmex for >= 7.5 * Add comment for cublassgemmex Remove fixed seed for test_sparse_nd_save_load (apache#11920) * Remove fixed seed for test_sparse_nd_save_load * Add comments related to the commit Corrections to profiling tutorial (apache#11887) Corrected a race condition with stopping profiling. Added mx.nd.waitall to ensure all operations have completed, including GPU operations that might otherwise be missing. Also added alternative code for context selection GPU vs CPU, that had error before on machines with nvidia-smi. Fix image classification scripts and Improve Fp16 tutorial (apache#11533) * fix bugs and improve tutorial * improve logging * update benchmark_score * Update float16.md * update link to dmlc web data * fix train cifar and add random mirroring * set aug defaults * fix whitespace * fix typo

* Replace cublassgemm with cublassgemmex for >= 7.5 * Add comment for cublassgemmex

anirudh2290 requested a review from marcoabreu as a code owner July 10, 2018 15:44

anirudh2290 changed the title ~~Fix flaky test test_deconvolution~~ [WIP] Fix flaky test test_deconvolution Jul 10, 2018

marcoabreu suggested changes Jul 10, 2018

View reviewed changes

marcoabreu mentioned this pull request Jul 11, 2018

Fix build issue with USE_CUDNN=0 #11470

Merged

7 tasks

marcoabreu approved these changes Jul 11, 2018

View reviewed changes

Replace cublassgemm with cublassgemmex for >= 7.5

08c25e7

anirudh2290 force-pushed the fix_deconv_flaky_test_final branch from 3e6b20a to 08c25e7 Compare July 13, 2018 13:25

anirudh2290 mentioned this pull request Jul 14, 2018

Inconsistent results between test_batchnorm_fallback and test_batchnorm_training #11758

Open

anirudh2290 changed the title ~~[WIP] Fix flaky test test_deconvolution~~ Fix flaky test test_deconvolution Jul 16, 2018

lupesko reviewed Jul 18, 2018

View reviewed changes

Add comment for cublassgemmex

5cbadda

anirudh2290 mentioned this pull request Jul 25, 2018

Flaky test_deconvolution #10973

Closed

anirudh2290 merged commit bd3fc88 into apache:master Jul 28, 2018

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

Fix flaky test test_deconvolution (apache#11630)

16e78e3

* Replace cublassgemm with cublassgemmex for >= 7.5 * Add comment for cublassgemmex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test test_deconvolution #11630

Fix flaky test test_deconvolution #11630

anirudh2290 commented Jul 10, 2018 •

edited

Loading

marcoabreu left a comment

marcoabreu Jul 10, 2018

anirudh2290 Jul 11, 2018

marcoabreu Jul 11, 2018

anirudh2290 Jul 11, 2018

marcoabreu Jul 11, 2018

marcoabreu Jul 11, 2018

anirudh2290 Jul 11, 2018

marcoabreu Jul 10, 2018

anirudh2290 Jul 10, 2018

marcoabreu Jul 10, 2018

anirudh2290 Jul 10, 2018

marcoabreu Jul 11, 2018 •

edited

Loading

marcoabreu Jul 11, 2018

anirudh2290 Jul 11, 2018

anirudh2290 Jul 11, 2018

marcoabreu Jul 11, 2018 •

edited

Loading

anirudh2290 Jul 11, 2018

marcoabreu Jul 11, 2018

anirudh2290 Jul 11, 2018

marcoabreu left a comment

lupesko left a comment

lupesko Jul 18, 2018

anirudh2290 Jul 18, 2018

piiswrong commented Jul 18, 2018

asmushetzel commented Jul 19, 2018

anirudh2290 commented Jul 19, 2018

anirudh2290 commented Jul 25, 2018

Fix flaky test test_deconvolution #11630

Fix flaky test test_deconvolution #11630

Conversation

anirudh2290 commented Jul 10, 2018 • edited Loading

Description

marcoabreu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu left a comment

Choose a reason for hiding this comment

lupesko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong commented Jul 18, 2018

asmushetzel commented Jul 19, 2018

anirudh2290 commented Jul 19, 2018

anirudh2290 commented Jul 25, 2018

anirudh2290 commented Jul 10, 2018 •

edited

Loading

marcoabreu Jul 11, 2018 •

edited

Loading

marcoabreu Jul 11, 2018 •

edited

Loading