Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Bypass cuda/cudnn checks if no driver. #15551

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 18 additions & 4 deletions src/common/cuda_utils.cc
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,15 @@ namespace cuda {
// Dynamic init here will emit a warning if runtime and compile-time cuda lib versions mismatch.
// Also if the user has recompiled their source to a version no longer tested by upstream CI.
bool cuda_version_check_performed = []() {
// Don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="")
if (dmlc::GetEnv("MXNET_CUDA_VERSION_CHECKING", true) && Context::GetGPUCount() > 0) {
// MXNet might be built on a machine with a cuda toolkit, but no GPUs or GPU driver.
// To allow that machine to execute say: python -c 'import mxnet; print(mxnet.__version__)',
// we won't perform a check if there is no driver. Any actual attempt to use the cuda API's
// will yield the desired message: CUDA driver version is insufficient for CUDA runtime version.
int cuda_driver_version = 0;
CUDA_CALL(cudaDriverGetVersion(&cuda_driver_version));
// Also, don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="")
if (dmlc::GetEnv("MXNET_CUDA_VERSION_CHECKING", true) && cuda_driver_version > 0
&& Context::GetGPUCount() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make Context::GetGPUCount properly return 0 if cuda_driver_version == 0 instead of adding this additional check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per your suggestion, I have reworked the PR and now have GetGPUCount return 0 if cuda_driver_version == 0.

Also, I feel now the best way to ensure not impacting non-gpu platforms is to perform the cuda/cudnn checks at the point where the user creates a GPU context (as opposed to the current approach that uses dynamic initialization of libmxnet.so).

Since the context creation is defined in ./include/mxnet/base.h, and since I need a non-header file to ensure only one lib version warning will be emitted, I've moved my prior work in ./src/common/cuda_utils.cc to a new file ./src/base.cc. This follows the code placement of (for example) resource.h/resource.cc.

// Not currently performing a runtime check of linked-against vs. compiled-against
// cuda runtime library, as major.minor must match for libmxnet.so to even load, per:
// https://docs.nvidia.com/deploy/cuda-compatibility/#binary-compatibility
Expand Down Expand Up @@ -82,8 +89,15 @@ namespace cudnn {
// Dynamic init here will emit a warning if runtime and compile-time cudnn lib versions mismatch.
// Also if the user has recompiled their source to a version no longer tested by upstream CI.
bool cudnn_version_check_performed = []() {
// Don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="")
if (dmlc::GetEnv("MXNET_CUDNN_VERSION_CHECKING", true) && Context::GetGPUCount() > 0) {
// MXNet might be built on a machine with a cuda toolkit, but no GPUs or GPU driver.
// To allow that machine to execute say: python -c 'import mxnet; print(mxnet.__version__)',
// we won't perform a check if there is no driver. Any actual attempt to use the cuda API's
// will yield the desired message: CUDA driver version is insufficient for CUDA runtime version.
int cuda_driver_version = 0;
CUDA_CALL(cudaDriverGetVersion(&cuda_driver_version));
// Also, don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="")
if (dmlc::GetEnv("MXNET_CUDNN_VERSION_CHECKING", true) && cuda_driver_version > 0
&& Context::GetGPUCount() > 0) {
size_t linkedAgainstCudnnVersion = cudnnGetVersion();
if (linkedAgainstCudnnVersion != CUDNN_VERSION)
LOG(WARNING) << "cuDNN library mismatch: linked-against version " << linkedAgainstCudnnVersion
Expand Down