Cudnn conv dgrad algo filtering #14310

DickJC123 · 2019-03-03T23:46:41Z

Description

I've learned that for cudnn versions in the range [7.3.1,7.5.0), the Convolution dgrad algorithm 3 (CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING) may produce incorrect results for some strided convolutions. This is not something that would generally appear in the current CI, which builds against cuda9.1 and cudnn7.1(so before the problem was introduced).

I've created a test to expose the problem, as well as a fix. I've noticed that there is a CI build for tensorrt that uses cuda10.0 and cudnn7.4, which should exhibit the problem. Thus, the plan is:

Do an initial PR submission that includes the problem-exposing test, and with a temporary addition of that test to the tensorrt CI test suite.
If hopefully this demonstrates the problem, then push the fix (which involves filtering-out algo 3 from the results of cudnnFind when necessary).
When the fix is shown to work, remove the explicit running of the test from the tensorrt CI, allowing the normal running of the test as part of test_operator_gpu.py to catch any return of the problem.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

…nn7.4.2)

anirudhacharya · 2019-03-04T00:31:44Z

@mxnet-label-bot add [pr-awaiting-review]

DickJC123 · 2019-03-04T19:22:53Z

Finished first step of this PR: 'Do an initial PR submission that includes the problem-exposing test, and with a temporary addition of that test to the tensorrt CI test suite.' CI passed, except for the one newly-introduced test (as expected):

======================================================================

FAIL: test_operator_gpu.test_conv_deconv_guards

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 543, in test_conv_deconv_guards

    check_consistency([sym, sym_no_cudnn], [ctx, ctx], tol=tol)

  File "/work/mxnet/python/mxnet/test_utils.py", line 1379, in check_consistency

    raise e

  File "/work/mxnet/python/mxnet/test_utils.py", line 1374, in check_consistency

    equal_nan=equal_nan)

  File "/work/mxnet/python/mxnet/test_utils.py", line 495, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 19129.703125 exceeds tolerance rtol=0.100000, atol=0.100000.  Location of maximum error:(1, 21, 49, 24), a=-2059.578857, b=-0.076599

 a: array([[[[ -200.87648   ,   365.06775   ,   -56.88517   , ...,

            100.96652   ,  -134.89548   ,   156.62936   ],

         [ -239.07794   ,   498.35898   ,   138.86926   , ...,...

 b: array([[[[ -200.87619   ,   365.0675    ,   -56.885128  , ...,

            100.96669   ,  -134.89523   ,   156.62917   ],

         [ -239.07797   ,   498.35883   ,   138.86902   , ...,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1727291009 to reproduce.

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=781016024 to reproduce.

``

DickJC123 · 2019-03-04T22:17:08Z

I've now finished the second step of the PR: 'push the fix (which involves filtering-out algo 3 from the results of cudnnFind when necessary).' From the CI logs for the tensorrt build (the only one that previously failed due to its use of cudnn 7.4), now we have:

----------------------------------------------------------------------

XML: /work/mxnet/nosetests_trt_gpu.xml

[success] 100.00% test_operator_gpu.test_conv_deconv_guards: 21.4772s

----------------------------------------------------------------------

Ran 1 test in 21.479s


OK

…a10, cudnn7.4.2)" This reverts commit 1cb743b.

DickJC123 · 2019-03-05T21:40:59Z

This PR is in a good state to review. I hope you guys like the test-driven-development ;-)
@szha @eric-haibin-lin @marcoabreu

…o_filtering

DickJC123 · 2019-03-12T20:41:24Z

This PR as been ready to go for a week. Should I be concerned there are no reviewers/assignees yet? @eric-haibin-lin @szha

The idea with this is to protect MXNet users from issues present in more advanced cudnn versions than what is used in the standard MXNet builds and CI.

DickJC123 · 2019-03-13T16:26:34Z

Still looking for reviewers of this fairly small PR. @marcoabreu @larroy ?

marcoabreu · 2019-03-13T16:31:42Z

I have pinged a few reviewers, you should get a response shortly. Please excuse the delay

wkcn

I am sorry that we review late.
I have reviewed this PR, LGTM.
Thanks for your contribution!

I have a question: will the test case trigger algo_exclusion=True?

DickJC123 · 2019-03-13T17:58:54Z

Yes, the test case was designed to trigger the failure (see earlier post). With the fix in place, the algo_exclusion==True path is activated to ensure the problematic algo is bypassed.

* Add test exposing issue with conv dgrad algo 3 for some cudnn's. * Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2) * Relax tol of new test. * Fix for problematic conv dgrad algo 3 for some cuDNNs. * Add algo exclusion term to cudnnFind result processing. * Revert "Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)" This reverts commit 1cb743b. * Trigger CI. * Add link to cuDNN release notes. * Trigger CI.

DickJC123 added 2 commits March 3, 2019 15:03

Add test exposing issue with conv dgrad algo 3 for some cudnn's.

817bad0

Add test temporarily to tests run with tensorrt CI build (cuda10, cud…

1cb743b

…nn7.4.2)

marcoabreu added the pr-awaiting-review PR is waiting for code review label Mar 4, 2019

Relax tol of new test.

0cbd0c7

DickJC123 added 2 commits March 4, 2019 12:00

Fix for problematic conv dgrad algo 3 for some cuDNNs.

8c82498

Add algo exclusion term to cudnnFind result processing.

5516b64

DickJC123 added 2 commits March 4, 2019 14:22

Revert "Add test temporarily to tests run with tensorrt CI build (cud…

989802d

…a10, cudnn7.4.2)" This reverts commit 1cb743b.

Trigger CI.

32d733d

DickJC123 added 5 commits March 5, 2019 17:09

Add link to cuDNN release notes.

95cc6ae

Merge remote-tracking branch 'mxnet/master' into cudnn_conv_dgrad_alg…

959b84f

…o_filtering

Merge remote-tracking branch 'mxnet/master' into cudnn_conv_dgrad_alg…

c8ba6e6

…o_filtering

Merge remote-tracking branch 'mxnet/master' into cudnn_conv_dgrad_alg…

02b55e6

…o_filtering

Trigger CI.

85866b8

wkcn approved these changes Mar 13, 2019

View reviewed changes

szha approved these changes Mar 13, 2019

View reviewed changes

marcoabreu merged commit ce99e49 into apache:master Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cudnn conv dgrad algo filtering #14310

Cudnn conv dgrad algo filtering #14310

DickJC123 commented Mar 3, 2019

anirudhacharya commented Mar 4, 2019

DickJC123 commented Mar 4, 2019

DickJC123 commented Mar 4, 2019

DickJC123 commented Mar 5, 2019

DickJC123 commented Mar 12, 2019

DickJC123 commented Mar 13, 2019

marcoabreu commented Mar 13, 2019

wkcn left a comment

DickJC123 commented Mar 13, 2019

Cudnn conv dgrad algo filtering #14310

Cudnn conv dgrad algo filtering #14310

Conversation

DickJC123 commented Mar 3, 2019

Description

Checklist

Essentials

Changes

Comments

anirudhacharya commented Mar 4, 2019

DickJC123 commented Mar 4, 2019

DickJC123 commented Mar 4, 2019

DickJC123 commented Mar 5, 2019

DickJC123 commented Mar 12, 2019

DickJC123 commented Mar 13, 2019

marcoabreu commented Mar 13, 2019

wkcn left a comment

Choose a reason for hiding this comment

DickJC123 commented Mar 13, 2019