Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

cuDNN support cleanup #15812

Merged
merged 3 commits into from
Aug 13, 2019
Merged

Conversation

DickJC123
Copy link
Contributor

Description

Background: Now that MXNet 1.5 no longer compiles against cuDNN v6 and earlier, any cuDNN code that special-cases those earlier versions is needlessly cluttering the codebase. PR #15449 introduced the STATIC_ASSERT_CUDNN_VERSION_GE() macro as a way to document and enforce the requirements of the codebase locally in the files that create those dependencies.

This PR removes code that needlessly adapts operators to cuDNN v6 (and earlier) APIs and uses STATIC_ASSERT_CUDNN_VERSION_GE() to enforce the restriction to operate only against cuDNN v7 and later. This cleanup helps create a concise and understandable codebase able to accept further feature improvements.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • [X ] Changes are complete (i.e. I finished coding on this PR)
  • [X ] All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • [X ] Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • [X ] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@DickJC123 DickJC123 changed the title Cudnn support cleanup cuDNN support cleanup Aug 9, 2019
@marcoabreu
Copy link
Contributor

Great to see this cleanup!

While I understand the intention of the assert, I feel it's a bit obsolete and makes the code a bit "cluttered" if you understand what I mean. Couldn't we have the assertion in a central place (aka on-init) instead of re-asserting the same thing over and over? The minimum CuDNN version is 7 and thus we treat it as default. That might make the code a bit more cleaner. But I don't have strong feelings about that tbh.

@DickJC123
Copy link
Contributor Author

DickJC123 commented Aug 9, 2019

I did consider the 'single assert' approach you mentioned. I ultimately advocated sprinkling a handful of asserts within the files having the actual dependencies. See #15449 (comment). We now have 17 assert lines. I don't think every cuDNN operator impl needs to state its dependency, since some are fine with very old cuDNN versions. How about as we advance the cuDNN support level, we clean out the old asserts that mention earlier versions that aren't out 'in the wild' anymore?

Since you say you don't feel strongly, I prefer to leave the asserts as is. We can always move toward fewer asserts going forward. I notice now the stats on this PR- a net of about 750 lines cleaned out!

@@ -44,6 +44,10 @@ namespace op {
*/
template <typename CuDNNAlgoType>
class CuDNNAlgo {
// Compile guard needed to get past non-nvcc compilation of cudnn_algoreg.cc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a fix that removes the compile guard. Before the fix and without the guard, the following step of the compile failed:

g++ ... -DMSHADOW_USE_CUDNN=1 ... -c cudnn_algoreg.cc

The problem is that the cudnn_algoreg-inl.h and cudnn_algoreg.cc files look at MXNET_USE_CUDNN, not MSHADOW_USE_CUDNN. The header file ./include/mxnet/libinfo.h has the vital missing line that equates the two:

#define MXNET_USE_CUDNN MSHADOW_USE_CUDNN

So this was a case of not adhering to the rule 'include what you use.' The header file ./src/common/cuda_utils.h was referring to MXNET_USE_CUDNN, but not including any header file that would ensure libinfo.h was included. Now fixed.

Now that MSHADOW is part of MXNET and no longer a submodule, is there a benefit to consolidating the dual naming as in {MSHADOW,MXNET}_USE_CUDNN?

@@ -44,6 +44,9 @@ enum CuDNNBatchNormOpAuxiliary {kMovingMean, kMovingInvVar};
#if defined(__CUDACC__)
template<typename DType>
class CuDNNBatchNormOp {
// This file was formerly included in batch_norm.cu only when CUDNN_MAJOR >= 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may be useful for review of this PR, but I don't think it makes sense to have it going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take this out. Also explanations of what features of cuDNN are being used that require the assert, as you mention below.

@ptrendx
Copy link
Member

ptrendx commented Aug 9, 2019

I'm not sure how I feel about the comments like "This file uses X and Y feature of cudnn 7" - I don't think it is actually useful for maintenance.

@marcoabreu
Copy link
Contributor

Sounds good to me, thanks for the explanation!

Yeah it's certainly a great cleanup, no doubt about that :)

#if CUDNN_VERSION >= 7200
if (GetEnvAllowTensorCore() && GetEnvAllowTensorCoreConversion() &&
(DataType<DType>::kFlag != kFloat16))
math_type = CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking for this PR, but this change reminded me that I believe there's a bug here. IIRC the problem was that math_type can be reset but auto-tuning. Issue I opened a while ago that I haven't had time to follow up on:

#14684

"Auto-tuning is overwriting the math mode at convolution tuning time. Probably the right thing to do when implementing TCs but it's preventing the conversion math type from being used. We'll have to think about the long-term fix for this, but I've currently commented out the math type reset locally and I'm trying to verify this cudnn feature provides a significant speedup before moving forward. "

Copy link
Contributor

@KellenSunderland KellenSunderland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nit that we could remove the comment already highlighted, but the PR LGTM.

@KellenSunderland KellenSunderland merged commit c81535c into apache:master Aug 13, 2019
anirudhacharya pushed a commit to anirudhacharya/mxnet that referenced this pull request Aug 20, 2019
* Remove old cudnn support (<v7).

* Simplify conv impl selection.

* Remove comments justifying STATIC_ASSERT_CUDNN_VERSION_GE.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants