Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Bypass cuda/cudnn checks if no driver. #15551

Merged

Conversation

DickJC123
Copy link
Contributor

Description

This provides the fix for the issue described in #15548. Thank you @vrakesh for your efforts detecting and posting this issue.

The problem is that libmxnet.so can be built with MXNET_USE_CUDA and MXNET_USE_CUDNN, but be available on a system with no installed driver (or GPUs). The absence of a driver can be detected by a 0 driver_version set by cudaDriverGetVersion(&driver_version). This fix includes code that bypasses the checks introduced by my recent PR #15449 for the case when there is no driver.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • [X ] Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • [ X] Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

CUDA_CALL(cudaDriverGetVersion(&cuda_driver_version));
// Also, don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="")
if (dmlc::GetEnv("MXNET_CUDA_VERSION_CHECKING", true) && cuda_driver_version > 0
&& Context::GetGPUCount() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make Context::GetGPUCount properly return 0 if cuda_driver_version == 0 instead of adding this additional check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per your suggestion, I have reworked the PR and now have GetGPUCount return 0 if cuda_driver_version == 0.

Also, I feel now the best way to ensure not impacting non-gpu platforms is to perform the cuda/cudnn checks at the point where the user creates a GPU context (as opposed to the current approach that uses dynamic initialization of libmxnet.so).

Since the context creation is defined in ./include/mxnet/base.h, and since I need a non-header file to ensure only one lib version warning will be emitted, I've moved my prior work in ./src/common/cuda_utils.cc to a new file ./src/base.cc. This follows the code placement of (for example) resource.h/resource.cc.

@karan6181
Copy link
Contributor

@mxnet-label-bot add [CUDA, pr-awaiting-response]

@marcoabreu marcoabreu added CUDA pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Jul 16, 2019
@KellenSunderland
Copy link
Contributor

LGTM so far. There's a minor whitespace lint issue picked up by CI.

@DickJC123
Copy link
Contributor Author

I went back and confirmed that the warnings are properly generated, although now they appear at the time of first use of a gpu context:

$ python
>>> import mxnet as mx
>>> ctx = mx.gpu(0)
>>> x = mx.nd.array([1,2,3], ctx=ctx)
[19:46:59] src/base.cc:51: Upgrade advisory: this mxnet has been built against cuda library version 9000, which is older than the oldest version tested by CI (10000).  Set MXNET_CUDA_LIB_CHECKING=0 to quiet this warning.
[19:46:59] src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7102, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
>>> y = mx.nd.array([4,5,6], ctx=mx.gpu(1))
>>> exit()

Moving the libmxnet.so to a system with a different cudnn version generates the additional warning at the same point as shown above:

[19:50:10] src/base.cc:80: cuDNN lib mismatch: linked-against version 7201 != compiled-against version 7102.  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.

I've also verified that the env vars can be used to eliminate these warnings.
I feel this PR should be merged when it passes CI.

@@ -417,8 +434,21 @@ inline Context Context::GPU(int32_t dev_id) {
return Create(kGPU, dev_id);
}

inline bool Context::GPUDriverPresent() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😀

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ptrendx ptrendx merged commit cb0697f into apache:master Jul 17, 2019
Copy link
Contributor

@KellenSunderland KellenSunderland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vrakesh
Copy link
Contributor

vrakesh commented Jul 17, 2019

Nice thanks a lot

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CUDA pr-awaiting-response PR is reviewed and waiting for contributor to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants