-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Bypass cuda/cudnn checks if no driver. #15551
Bypass cuda/cudnn checks if no driver. #15551
Conversation
src/common/cuda_utils.cc
Outdated
CUDA_CALL(cudaDriverGetVersion(&cuda_driver_version)); | ||
// Also, don't bother with checks if there are no GPUs visible (e.g. with CUDA_VISIBLE_DEVICES="") | ||
if (dmlc::GetEnv("MXNET_CUDA_VERSION_CHECKING", true) && cuda_driver_version > 0 | ||
&& Context::GetGPUCount() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make Context::GetGPUCount
properly return 0 if cuda_driver_version == 0
instead of adding this additional check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per your suggestion, I have reworked the PR and now have GetGPUCount return 0 if cuda_driver_version == 0.
Also, I feel now the best way to ensure not impacting non-gpu platforms is to perform the cuda/cudnn checks at the point where the user creates a GPU context (as opposed to the current approach that uses dynamic initialization of libmxnet.so).
Since the context creation is defined in ./include/mxnet/base.h, and since I need a non-header file to ensure only one lib version warning will be emitted, I've moved my prior work in ./src/common/cuda_utils.cc to a new file ./src/base.cc. This follows the code placement of (for example) resource.h/resource.cc.
@mxnet-label-bot add [CUDA, pr-awaiting-response] |
LGTM so far. There's a minor whitespace lint issue picked up by CI. |
I went back and confirmed that the warnings are properly generated, although now they appear at the time of first use of a gpu context:
Moving the libmxnet.so to a system with a different cudnn version generates the additional warning at the same point as shown above:
I've also verified that the env vars can be used to eliminate these warnings. |
@@ -417,8 +434,21 @@ inline Context Context::GPU(int32_t dev_id) { | |||
return Create(kGPU, dev_id); | |||
} | |||
|
|||
inline bool Context::GPUDriverPresent() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice thanks a lot |
Description
This provides the fix for the issue described in #15548. Thank you @vrakesh for your efforts detecting and posting this issue.
The problem is that libmxnet.so can be built with MXNET_USE_CUDA and MXNET_USE_CUDNN, but be available on a system with no installed driver (or GPUs). The absence of a driver can be detected by a 0 driver_version set by cudaDriverGetVersion(&driver_version). This fix includes code that bypasses the checks introduced by my recent PR #15449 for the case when there is no driver.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments