Bypass cuda/cudnn checks if no driver. #15551

DickJC123 · 2019-07-16T02:25:52Z

Description

This provides the fix for the issue described in #15548. Thank you @vrakesh for your efforts detecting and posting this issue.

The problem is that libmxnet.so can be built with MXNET_USE_CUDA and MXNET_USE_CUDNN, but be available on a system with no installed driver (or GPUs). The absence of a driver can be detected by a 0 driver_version set by cudaDriverGetVersion(&driver_version). This fix includes code that bypasses the checks introduced by my recent PR #15449 for the case when there is no driver.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
[X ] Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
[ X] Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

marcoabreu · 2019-07-16T13:10:20Z

src/common/cuda_utils.cc

+  CUDA_CALL(cudaDriverGetVersion(&cuda_driver_version));
+  // Also, don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="")
+  if (dmlc::GetEnv("MXNET_CUDA_VERSION_CHECKING", true) && cuda_driver_version > 0
+                                                        && Context::GetGPUCount() > 0) {


Could you make Context::GetGPUCount properly return 0 if cuda_driver_version == 0 instead of adding this additional check?

Per your suggestion, I have reworked the PR and now have GetGPUCount return 0 if cuda_driver_version == 0.

Also, I feel now the best way to ensure not impacting non-gpu platforms is to perform the cuda/cudnn checks at the point where the user creates a GPU context (as opposed to the current approach that uses dynamic initialization of libmxnet.so).

Since the context creation is defined in ./include/mxnet/base.h, and since I need a non-header file to ensure only one lib version warning will be emitted, I've moved my prior work in ./src/common/cuda_utils.cc to a new file ./src/base.cc. This follows the code placement of (for example) resource.h/resource.cc.

karan6181 · 2019-07-16T23:02:39Z

@mxnet-label-bot add [CUDA, pr-awaiting-response]

KellenSunderland · 2019-07-17T04:30:26Z

LGTM so far. There's a minor whitespace lint issue picked up by CI.

DickJC123 · 2019-07-17T19:56:24Z

I went back and confirmed that the warnings are properly generated, although now they appear at the time of first use of a gpu context:

$ python
>>> import mxnet as mx
>>> ctx = mx.gpu(0)
>>> x = mx.nd.array([1,2,3], ctx=ctx)
[19:46:59] src/base.cc:51: Upgrade advisory: this mxnet has been built against cuda library version 9000, which is older than the oldest version tested by CI (10000).  Set MXNET_CUDA_LIB_CHECKING=0 to quiet this warning.
[19:46:59] src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7102, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
>>> y = mx.nd.array([4,5,6], ctx=mx.gpu(1))
>>> exit()

Moving the libmxnet.so to a system with a different cudnn version generates the additional warning at the same point as shown above:

[19:50:10] src/base.cc:80: cuDNN lib mismatch: linked-against version 7201 != compiled-against version 7102.  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.

I've also verified that the env vars can be used to eliminate these warnings.
I feel this PR should be merged when it passes CI.

larroy · 2019-07-17T20:40:20Z

include/mxnet/base.h

@@ -417,8 +434,21 @@ inline Context Context::GPU(int32_t dev_id) {
  return Create(kGPU, dev_id);
 }

+inline bool Context::GPUDriverPresent() {


larroy

LGTM

KellenSunderland

LGTM

vrakesh · 2019-07-17T22:27:19Z

Nice thanks a lot

Bypass cuda/cudnn checks if no driver.

c670891

marcoabreu reviewed Jul 16, 2019

View reviewed changes

marcoabreu added CUDA pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Jul 16, 2019

Perform cuda/cudnn checks only when a gpu context is created.

6be5b96

DickJC123 requested review from anirudh2290, eric-haibin-lin and szha as code owners July 17, 2019 02:27

marcoabreu approved these changes Jul 17, 2019

View reviewed changes

Fix cpplint.

8eddde5

larroy reviewed Jul 17, 2019

View reviewed changes

larroy approved these changes Jul 17, 2019

View reviewed changes

ptrendx merged commit cb0697f into apache:master Jul 17, 2019

KellenSunderland reviewed Jul 17, 2019

View reviewed changes

ptrendx mentioned this pull request Jul 17, 2019

MXNet cu100 nightly release breaks Horovod integration tests #15578

Closed

vrakesh mentioned this pull request Jul 17, 2019

MXNet GPU build on CPU machine fails #15548

Closed

This was referenced Jul 18, 2019

MXNet cu100-pre broke docker build on CPU machine horovod/horovod#1217

Closed

Pin tf-nightly-gpu and mxnet-cu100 --pre horovod/horovod#1227

Merged

This was referenced Jul 18, 2019

Export symbols for checking cuda/cudnn versions #15591

Closed

Fix integration with MXNet nightly build horovod/horovod#1244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass cuda/cudnn checks if no driver. #15551

Bypass cuda/cudnn checks if no driver. #15551

DickJC123 commented Jul 16, 2019

marcoabreu Jul 16, 2019

DickJC123 Jul 17, 2019

karan6181 commented Jul 16, 2019

KellenSunderland commented Jul 17, 2019

DickJC123 commented Jul 17, 2019

larroy Jul 17, 2019

larroy left a comment

KellenSunderland left a comment

vrakesh commented Jul 17, 2019

Bypass cuda/cudnn checks if no driver. #15551

Bypass cuda/cudnn checks if no driver. #15551

Conversation

DickJC123 commented Jul 16, 2019

Description

Checklist

Essentials

Changes

Comments

marcoabreu Jul 16, 2019

Choose a reason for hiding this comment

DickJC123 Jul 17, 2019

Choose a reason for hiding this comment

karan6181 commented Jul 16, 2019

KellenSunderland commented Jul 17, 2019

DickJC123 commented Jul 17, 2019

larroy Jul 17, 2019

Choose a reason for hiding this comment

larroy left a comment

Choose a reason for hiding this comment

KellenSunderland left a comment

Choose a reason for hiding this comment

vrakesh commented Jul 17, 2019