Skip to content

[cudnn][nccl] Don't download bits if not found#16031

Closed
jacobkahn wants to merge 1 commit intomicrosoft:masterfrom
jacobkahn:nodownload_cudnn_nccl
Closed

[cudnn][nccl] Don't download bits if not found#16031
jacobkahn wants to merge 1 commit intomicrosoft:masterfrom
jacobkahn:nodownload_cudnn_nccl

Conversation

@jacobkahn
Copy link
Contributor

Automatically downloading cuDNN or NCCL packages leads to some bad behavior,
especially given that the minumum CUDA version required for the vcpkg port
is 10.1 but CUDA 11.x is the newest release from NVIDIA.

The auto-download is a problem because cuDNN and NCCL versions that are
used with a particular CUDA version have to be built with that version
of CUDA -- one can't use a cuDNN version that was built with CUDA 10.1
with a CUDA 11 installation - things won't link (and they shouldn't -
this is by design).

Since not finding NCCL or cuDNN results in downloading the bits and
those bits are frozen at CUDA 10.1 versions, using CUDA 11 with
vcpkg but not having NCCL or cuCUDNN installed and trying to
./vcpkg install cudnn nccl results in broken behavior for projects
that are attempting to link downstream since linking to cuDNN or NCCL
as downloaded by the port implicitly adds a runtime dependency to a
CUDA 10.1 shared lib, which might not be on the system if that's not
the installed version (common these days given CUDA 11 is out).

This PR guts that download behavior entirely so that we don't give
people cuDNN or NCCL versions that aren't compatible with their
CUDA installations.

  • Which triplets are supported/not supported? Have you updated the CI baseline?

No changes to supported triplets.

Yes.

Automatically downloading cuDNN or NCCL packages leads to some bad behavior,
especially given that the minumum CUDA version required for the vcpkg port
is 10.1 but CUDA 11.x is the newest release from NVIDIA.

The auto-download is a problem because cuDNN and NCCL versions that are
used with a particular CUDA version have to be built with that version
of CUDA -- one can't use a cuDNN version that was built with CUDA 10.1
with a CUDA 11 installation - things won't link (and they shouldn't -
this is by design).

Since not finding NCCL or cuDNN results in downloading the bits and
those bits are frozen at CUDA 10.1 versions, using CUDA 11 with
vcpkg but not having NCCL or cuCUDNN installed and trying to
`./vcpkg install cudnn nccl` results in broken behavior for projects
that are attempting to link downstream since linking to cuDNN or NCCL
as downloaded by the port implicitly adds a runtime dependency to a
CUDA 10.1 shared lib, which might not be on the system if that's not
the installed version (common these days given CUDA 11 is out).

This PR guts that download behavior entirely so that we don't give
people cuDNN or NCCL versions that aren't compatible with their
CUDA installations.
@jacobkahn jacobkahn force-pushed the nodownload_cudnn_nccl branch from 33e1ab0 to 8738819 Compare February 3, 2021 22:08
@JonLiu1993 JonLiu1993 assigned JonLiu1993 and JackBoosY and unassigned JonLiu1993 Feb 4, 2021
@JackBoosY JackBoosY added the category:port-bug The issue is with a library, which is something the port should already support label Feb 4, 2021
@JackBoosY
Copy link
Contributor

Starting package 2/2: cudnn:x64-windows
Building package cudnn[core]:x64-windows...
CMake Error at ports/cudnn/portfile.cmake:63 (message):
  Could not find cuDNN.  Before continuing, please download and install a
  cuDNN version

  that matches your CUDA version from:

      https://developer.nvidia.com/cuDNN

Call Stack (most recent call first):
  scripts/ports.cmake:128 (include)


Error: Building package cudnn:x64-windows failed with: BUILD_FAILED
Elapsed time for package cudnn:x64-windows: 147.1 ms

@cenit
Copy link
Contributor

cenit commented Feb 4, 2021

IMHO, that's not a good approach. Also because you're removing a nice feature upon which someone like me might have built a solution.

What do you think about doing this approach: both ports should trigger a FindCUDA and detect installed version. Based on installed version, if the external version is present, accept it, otherwise download the most appropriate version. If the CUDA version is unknown to the portfile and so it doesn't know the appropriate version to download and no external version is present, than a fatal error is acceptable.

@jacobkahn
Copy link
Contributor Author

@cenit — it's definitely a nice feature and I'd want to do something like this — the problem is that NVIDIA isn't officially supporting Conda packages for cuDNN and NCCL (I asked them), so we're unlikely to ever get CUDA > 10.1-compatible cuDNN and NCCL packages on there. NCCL only has a package for 10.1 based on the source I'm using. It's also not immediately clear that it's possible to call vcpkg_find_cuda from another portfile and extract the version (I gave it a try), cc @JackBoosY @BillyONeal.

@JackBoosY — these failures aren't unexpected/CI machines will need to be updated given this approach. But I'll circle back about that.

@cenit
Copy link
Contributor

cenit commented Feb 4, 2021

Oh that's a bummer. I mean, the fact that you asked and they explicitly said it's unsupported.
So we have to abandon auto download feature of cudnn?

About cuda version, I was thinking about doing a find_package(CUDA) directly, but it was just in my mind, and maybe not the cleanest vcpkg approach...

@jacobkahn
Copy link
Contributor Author

@cenit — vcpkg portfiles are run in script mode (cmake -P), so calls to find_package that use features that aren't available in script mode don't work — I ran into this when dropping find_package(CUDA) into these portfiles. It's really annoying that there isn't good support that would enable this... at the same time, auto-download only worked for CUDA 10.1, which a lot of people aren't using anymore. Probably best to gut this to avoid the breakage/confusion I mentioned in the PR summary for people using later versions.

@jacobkahn
Copy link
Contributor Author

jacobkahn commented Feb 4, 2021

Starting package 2/2: cudnn:x64-windows
Building package cudnn[core]:x64-windows...
CMake Error at ports/cudnn/portfile.cmake:63 (message):
  Could not find cuDNN.  Before continuing, please download and install a
  cuDNN version

  that matches your CUDA version from:

      https://developer.nvidia.com/cuDNN

Call Stack (most recent call first):
  scripts/ports.cmake:128 (include)


Error: Building package cudnn:x64-windows failed with: BUILD_FAILED
Elapsed time for package cudnn:x64-windows: 147.1 ms

@JackBoosY — we'd need to add cuDNN (and NCCL?) to the CI machines that have this failure since we're no longer auto-downloading. I'm happy to help point you to the correct version depending on which CUDA version is installed on these machines (I'm guessing 10.1 since the cuDNN autodownload works in CI as it is, and it wouldn't link if it were any other CUDA version...)

@BillyONeal
Copy link
Member

@JackBoosY — we'd need to add cuDNN (and NCCL?) to the CI machines that have this failure since we're no longer auto-downloading. I'm happy to help point you to the correct version depending on which CUDA version is installed on these machines (I'm guessing 10.1 since the cuDNN autodownload works in CI as it is, and it wouldn't link if it were any other CUDA version...)

It's CUDA 10.something. Each time we've tried to update it the install fails but we don't know why; we've been looking for someone familiar with CUDA to help diagnose' see install script here:

The prevailing theory I've heard is that newer CUDA tries to install the NVidia driver which fails because our servers don't have NVidia GPUs.

@jacobkahn
Copy link
Contributor Author

@mtmd might you have any wisdom on the above or be able to cc/connect us to others who might be able to resolve this? This affects a bunch of CUDA/cuDNN-dependent projects that are built with vcpkg.

@mtmd
Copy link

mtmd commented Feb 5, 2021

@mtmd might you have any wisdom on the above or be able to cc/connect us to others who might be able to resolve this? This affects a bunch of CUDA/cuDNN-dependent projects that are built with vcpkg.

@jacobkahn Thank you for pointing this out, Jacob. Let me raise the issue internally to find someone to address it.

@jacobkahn
Copy link
Contributor Author

Superseded by #16413

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category:port-bug The issue is with a library, which is something the port should already support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants