Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix flaky test test_deconvolution #11630

Merged
merged 2 commits into from
Jul 28, 2018

Conversation

anirudh2290
Copy link
Member

@anirudh2290 anirudh2290 commented Jul 10, 2018

Description

This is to fix the flaky test: #10973. Prereq PR: #11470.
cublasSgemm and cublasDgemm are currently used for linear algebra operations when cudnn is disabled for the Convolution and Deconvolution operators. The test_deconvolution test on gpu fails on the input gradients comparison for conv and deconv operators. When looking at the failures, the difference in the gradients are as big as 0.1. Replacing cublasSgemm with cublasSgemmex while keeping all other code the same, causes the failure to go away. Also, the failure doesnt happen on CPU with openblas or MKLDNN or on gpu with cudnn enabled. If both APIs work as documented, both cases should have seen the failure. My guess is that cublasSgemm is doing the conversion of float32 to float16 before doing the conversion, causing the precision issues. Digging through the cublas documentation, I couldn't find enough information on the computation for cublasSgemm.
Please see: http://lutgw1.lunet.edu/cuda/pdf/CUBLAS_Library.pdf 2.8.11 (cublassgemmex) and 2.7.1 (cublassgemm).

@DickJC123 @ptrendx any inputs on if there should be any difference in behavior of two apis when all tensors have float32 dtype.

The important changes are in the file: src/operator/linalg_impl.h

@anirudh2290 anirudh2290 changed the title Fix flaky test test_deconvolution [WIP] Fix flaky test test_deconvolution Jul 10, 2018
Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I missed something , I'm on my phone right now

Jenkinsfile Outdated
'Python3: MKLDNN-GPU-NOCUDNN': {
node('mxnetlinux-gpu') {
ws('workspace/ut-python3-mkldnn-gpu-nocudnn') {
withEnv(['CUDNN_DISABLED=ON']) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make a new script to call the unit tests in the runtime functions (you might just nest the calls to prevent copy and paste). Otherwise, the calls will not be reproducible- also, the environment variables will not be available in the docker container.
I'm on my phone, so please excusme if the question is answered below: why do we have to set that variable if we already compile without cudnn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. I have fixed it in PR: #11470. We are removing the C API in that PR so there is no way of finding out whether cudnn is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just always catch the error and skip it if it contains the message about cudnn not being available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well that approach would also work. but this is cleaner. for example, someone adding a new op which doesnt support cudnn off may have a different wording for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well then the error would be made visible in CI and the person would have to adjust the error message, right? Ideally, we'd have standardized error messages (e.g. error codes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is cleaner because that way you are catching all errors independently of whether they are actually related to cudnn. Imagine somebody makes a change to the cudnn behaviour but there is is still a bug - the error gets then omitted and we don't notice it. I'd rather have detailed and specific error handling rather than a global try-catch. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this test will only be ignored if the decorator is added, so only when cudnn off and for operators that dont support the configuration. People making changes to cudnn behavior should not be affected by this. Its a global try catch when cudnn is disabled for ops that dont support the configuration.

@make_decorator(orig_test)
def test_new(*args, **kwargs):
cudnn_disabled = (os.getenv('CUDNN_DISABLED') == "ON")
if not cudnn_disabled or mx.context.current_context().device_type == 'cpu':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's the environment variable. Could we give it a name that clearly shows that this environment variable is only for test purposes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed it to CUDNN_OFF_TEST_ONLY in the PR: #11470. Let us move the discussion for cudnn off tests to that PR.

if not cudnn_disabled or mx.context.current_context().device_type == 'cpu':
orig_test(*args, **kwargs)
else:
if assertion_error:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we want to differentiate between assertion errors? I think we should always handle them, no matter what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to differentiate between failures and the ops that dont support cudnn disabled. for example some ops when used with cudnn disabled would just throw an exception saying not supported. others would proceed but would compute wrongly and throw an assertion error.

Copy link
Contributor

@marcoabreu marcoabreu Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for elaborating. It seems to be like that you revealed a bug in our backend about the handling of the missing cudnn.

I don't think this is what should happen. If a user uses an operator although it's not supported, it should not make a false calculation but rather throw the error properly. This method would mask a problem that a user would silently encounter. We should rather investigate why the error is not being thrown and fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is there no CPU implementation of these operators we could fall back to instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep the issue was uncovered as part of the PR and is documented here: #11568 . As I have mentioned in #11470, this is not something we should be blocking the PR for. This issue seems could have been there for a long time. Its possible that there are user codes depending on the buggy behavior and raising an exception would be a breaking change for these users. Its better to fix the issue with cudnn enabled rather than throwing an exception saying cudnn not supported atleast for 1.3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to still test it when cudnn is enabled.

Copy link
Contributor

@marcoabreu marcoabreu Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have a 0 tolerance policy for faulty tests and deactivate them temporarily while taking reduced test coverage as a pill we have to swallow. It is a release requirement that all tests have to be enabled, so don't worry: it won't stay disables for long.

For Amazon employees I have received information that faulty and flaky tests have to be handled as sev2.5

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disable the test when it can be avoided. If it is for visibility then we already have an open issue. we can also call it a disabled test since it is disabled for cudnn off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, as you prefer. Just wanted to avoid having this hack in there, but I don't mind. Could you please add a comment in that test, linking to the issue and stating that the test is disabled when running without cudnn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comment about the assertion_error being temporary and link the associated issue.

Please don't forget to close the issues associated with the disabled tests after merging.

@anirudh2290 anirudh2290 changed the title [WIP] Fix flaky test test_deconvolution Fix flaky test test_deconvolution Jul 16, 2018
Copy link
Contributor

@lupesko lupesko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a small comment.
Also, please update the issue description - the Checklist is left unchanged from the template... you should either tick the relevant boxes or remove them altogether if not relevant. Thanks for fixing the issue!

A.stride_, &beta, C.dptr_, C.stride_)) \
}

#if CUDA_VERSION >= 7050
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here explaining the #if directive would be helpful for other (and yourself) in the future. Can we add it please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have added the comment. also removed the checklist for the current PR.

@piiswrong
Copy link
Contributor

@asmushetzel @DickJC123

@asmushetzel
Copy link
Contributor

While the changes seem all valid and fine, this remains still a mystery as there should be no difference between the two APIs if all operands are float32 (as anirudh stated above). A bit confused though as the original PR description mentions float16 (why?).
We definitely should try to get more info from NVidia in order to figure out what happened there.

@anirudh2290
Copy link
Member Author

@asmushetzel sorry about the confusion. To clarify, the documentation states that both APIs should behave exactly the same when all tensors have float32 dtype. I am guessing that the difference in behavior could then be because cublassgemm is converting tensors to float16 dtype (a bug?) for computation and could be the reason for the precision issues. I have asked Nvidia folks for inputs but haven't heard back from them yet.

@anirudh2290
Copy link
Member Author

I will wait till end of this week to hear back from Nvidia folks before merging this PR.

@anirudh2290 anirudh2290 merged commit bd3fc88 into apache:master Jul 28, 2018
aaronmarkham pushed a commit to aaronmarkham/incubator-mxnet that referenced this pull request Aug 7, 2018
* Replace cublassgemm with cublassgemmex for >= 7.5

* Add comment for cublassgemmex

Remove fixed seed for test_sparse_nd_save_load (apache#11920)

* Remove fixed seed for test_sparse_nd_save_load

* Add comments related to the commit

Corrections to profiling tutorial (apache#11887)

Corrected a race condition with stopping profiling. Added mx.nd.waitall to ensure all operations have completed, including GPU operations that might otherwise be missing.

Also added alternative code for context selection GPU vs CPU, that had error before on machines with nvidia-smi.

Fix image classification scripts and Improve Fp16 tutorial (apache#11533)

* fix bugs and improve tutorial

* improve logging

* update benchmark_score

* Update float16.md

* update link to dmlc web data

* fix train cifar and add random mirroring

* set aug defaults

* fix whitespace

* fix typo
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* Replace cublassgemm with cublassgemmex for >= 7.5

* Add comment for cublassgemmex
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants