[cudnn][nccl] Don't download bits if not found#16031
[cudnn][nccl] Don't download bits if not found#16031jacobkahn wants to merge 1 commit intomicrosoft:masterfrom
Conversation
Automatically downloading cuDNN or NCCL packages leads to some bad behavior, especially given that the minumum CUDA version required for the vcpkg port is 10.1 but CUDA 11.x is the newest release from NVIDIA. The auto-download is a problem because cuDNN and NCCL versions that are used with a particular CUDA version have to be built with that version of CUDA -- one can't use a cuDNN version that was built with CUDA 10.1 with a CUDA 11 installation - things won't link (and they shouldn't - this is by design). Since not finding NCCL or cuDNN results in downloading the bits and those bits are frozen at CUDA 10.1 versions, using CUDA 11 with vcpkg but not having NCCL or cuCUDNN installed and trying to `./vcpkg install cudnn nccl` results in broken behavior for projects that are attempting to link downstream since linking to cuDNN or NCCL as downloaded by the port implicitly adds a runtime dependency to a CUDA 10.1 shared lib, which might not be on the system if that's not the installed version (common these days given CUDA 11 is out). This PR guts that download behavior entirely so that we don't give people cuDNN or NCCL versions that aren't compatible with their CUDA installations.
33e1ab0 to
8738819
Compare
|
|
IMHO, that's not a good approach. Also because you're removing a nice feature upon which someone like me might have built a solution. What do you think about doing this approach: both ports should trigger a FindCUDA and detect installed version. Based on installed version, if the external version is present, accept it, otherwise download the most appropriate version. If the CUDA version is unknown to the portfile and so it doesn't know the appropriate version to download and no external version is present, than a fatal error is acceptable. |
|
@cenit — it's definitely a nice feature and I'd want to do something like this — the problem is that NVIDIA isn't officially supporting Conda packages for cuDNN and NCCL (I asked them), so we're unlikely to ever get CUDA > 10.1-compatible cuDNN and NCCL packages on there. NCCL only has a package for 10.1 based on the source I'm using. It's also not immediately clear that it's possible to call @JackBoosY — these failures aren't unexpected/CI machines will need to be updated given this approach. But I'll circle back about that. |
|
Oh that's a bummer. I mean, the fact that you asked and they explicitly said it's unsupported. About cuda version, I was thinking about doing a find_package(CUDA) directly, but it was just in my mind, and maybe not the cleanest vcpkg approach... |
|
@cenit — vcpkg portfiles are run in script mode ( |
@JackBoosY — we'd need to add cuDNN (and NCCL?) to the CI machines that have this failure since we're no longer auto-downloading. I'm happy to help point you to the correct version depending on which CUDA version is installed on these machines (I'm guessing 10.1 since the cuDNN autodownload works in CI as it is, and it wouldn't link if it were any other CUDA version...) |
It's CUDA 10.something. Each time we've tried to update it the install fails but we don't know why; we've been looking for someone familiar with CUDA to help diagnose' see install script here: The prevailing theory I've heard is that newer CUDA tries to install the NVidia driver which fails because our servers don't have NVidia GPUs. |
|
@mtmd might you have any wisdom on the above or be able to cc/connect us to others who might be able to resolve this? This affects a bunch of CUDA/cuDNN-dependent projects that are built with |
@jacobkahn Thank you for pointing this out, Jacob. Let me raise the issue internally to find someone to address it. |
|
Superseded by #16413 |
Automatically downloading cuDNN or NCCL packages leads to some bad behavior,
especially given that the minumum CUDA version required for the vcpkg port
is 10.1 but CUDA 11.x is the newest release from NVIDIA.
The auto-download is a problem because cuDNN and NCCL versions that are
used with a particular CUDA version have to be built with that version
of CUDA -- one can't use a cuDNN version that was built with CUDA 10.1
with a CUDA 11 installation - things won't link (and they shouldn't -
this is by design).
Since not finding NCCL or cuDNN results in downloading the bits and
those bits are frozen at CUDA 10.1 versions, using CUDA 11 with
vcpkg but not having NCCL or cuCUDNN installed and trying to
./vcpkg install cudnn ncclresults in broken behavior for projectsthat are attempting to link downstream since linking to cuDNN or NCCL
as downloaded by the port implicitly adds a runtime dependency to a
CUDA 10.1 shared lib, which might not be on the system if that's not
the installed version (common these days given CUDA 11 is out).
This PR guts that download behavior entirely so that we don't give
people cuDNN or NCCL versions that aren't compatible with their
CUDA installations.
No changes to supported triplets.
Yes.