Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Cudnn conv dgrad algo filtering #14310

Merged
merged 12 commits into from
Mar 13, 2019

Conversation

DickJC123
Copy link
Contributor

Description

I've learned that for cudnn versions in the range [7.3.1,7.5.0), the Convolution dgrad algorithm 3 (CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING) may produce incorrect results for some strided convolutions. This is not something that would generally appear in the current CI, which builds against cuda9.1 and cudnn7.1(so before the problem was introduced).

I've created a test to expose the problem, as well as a fix. I've noticed that there is a CI build for tensorrt that uses cuda10.0 and cudnn7.4, which should exhibit the problem. Thus, the plan is:

  • Do an initial PR submission that includes the problem-exposing test, and with a temporary addition of that test to the tensorrt CI test suite.
  • If hopefully this demonstrates the problem, then push the fix (which involves filtering-out algo 3 from the results of cudnnFind when necessary).
  • When the fix is shown to work, remove the explicit running of the test from the tensorrt CI, allowing the normal running of the test as part of test_operator_gpu.py to catch any return of the problem.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Mar 4, 2019
@DickJC123
Copy link
Contributor Author

Finished first step of this PR: 'Do an initial PR submission that includes the problem-exposing test, and with a temporary addition of that test to the tensorrt CI test suite.' CI passed, except for the one newly-introduced test (as expected):

======================================================================

FAIL: test_operator_gpu.test_conv_deconv_guards

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 543, in test_conv_deconv_guards

    check_consistency([sym, sym_no_cudnn], [ctx, ctx], tol=tol)

  File "/work/mxnet/python/mxnet/test_utils.py", line 1379, in check_consistency

    raise e

  File "/work/mxnet/python/mxnet/test_utils.py", line 1374, in check_consistency

    equal_nan=equal_nan)

  File "/work/mxnet/python/mxnet/test_utils.py", line 495, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 19129.703125 exceeds tolerance rtol=0.100000, atol=0.100000.  Location of maximum error:(1, 21, 49, 24), a=-2059.578857, b=-0.076599

 a: array([[[[ -200.87648   ,   365.06775   ,   -56.88517   , ...,

            100.96652   ,  -134.89548   ,   156.62936   ],

         [ -239.07794   ,   498.35898   ,   138.86926   , ...,...

 b: array([[[[ -200.87619   ,   365.0675    ,   -56.885128  , ...,

            100.96669   ,  -134.89523   ,   156.62917   ],

         [ -239.07797   ,   498.35883   ,   138.86902   , ...,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1727291009 to reproduce.

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=781016024 to reproduce.

``

@DickJC123
Copy link
Contributor Author

I've now finished the second step of the PR: 'push the fix (which involves filtering-out algo 3 from the results of cudnnFind when necessary).' From the CI logs for the tensorrt build (the only one that previously failed due to its use of cudnn 7.4), now we have:

----------------------------------------------------------------------

XML: /work/mxnet/nosetests_trt_gpu.xml

[success] 100.00% test_operator_gpu.test_conv_deconv_guards: 21.4772s

----------------------------------------------------------------------

Ran 1 test in 21.479s


OK

@DickJC123
Copy link
Contributor Author

This PR is in a good state to review. I hope you guys like the test-driven-development ;-)
@szha @eric-haibin-lin @marcoabreu

@DickJC123
Copy link
Contributor Author

This PR as been ready to go for a week. Should I be concerned there are no reviewers/assignees yet? @eric-haibin-lin @szha

The idea with this is to protect MXNet users from issues present in more advanced cudnn versions than what is used in the standard MXNet builds and CI.

@DickJC123
Copy link
Contributor Author

Still looking for reviewers of this fairly small PR. @marcoabreu @larroy ?

@marcoabreu
Copy link
Contributor

I have pinged a few reviewers, you should get a response shortly. Please excuse the delay

Copy link
Member

@wkcn wkcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry that we review late.
I have reviewed this PR, LGTM.
Thanks for your contribution!

I have a question: will the test case trigger algo_exclusion=True?

@DickJC123
Copy link
Contributor Author

Yes, the test case was designed to trigger the failure (see earlier post). With the fix in place, the algo_exclusion==True path is activated to ensure the problematic algo is bypassed.

@marcoabreu marcoabreu merged commit ce99e49 into apache:master Mar 13, 2019
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* Add test exposing issue with conv dgrad algo 3 for some cudnn's.

* Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)

* Relax tol of new test.

* Fix for problematic conv dgrad algo 3 for some cuDNNs.

* Add algo exclusion term to cudnnFind result processing.

* Revert "Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)"

This reverts commit 1cb743b.

* Trigger CI.

* Add link to cuDNN release notes.

* Trigger CI.
nswamy pushed a commit that referenced this pull request Apr 5, 2019
* Add test exposing issue with conv dgrad algo 3 for some cudnn's.

* Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)

* Relax tol of new test.

* Fix for problematic conv dgrad algo 3 for some cuDNNs.

* Add algo exclusion term to cudnnFind result processing.

* Revert "Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)"

This reverts commit 1cb743b.

* Trigger CI.

* Add link to cuDNN release notes.

* Trigger CI.
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Add test exposing issue with conv dgrad algo 3 for some cudnn's.

* Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)

* Relax tol of new test.

* Fix for problematic conv dgrad algo 3 for some cuDNNs.

* Add algo exclusion term to cudnnFind result processing.

* Revert "Add test temporarily to tests run with tensorrt CI build (cuda10, cudnn7.4.2)"

This reverts commit 1cb743b.

* Trigger CI.

* Add link to cuDNN release notes.

* Trigger CI.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants