[cudnn][nccl] Don't download bits if not found by jacobkahn · Pull Request #16031 · microsoft/vcpkg

jacobkahn · 2021-02-03T22:06:33Z

Automatically downloading cuDNN or NCCL packages leads to some bad behavior,
especially given that the minumum CUDA version required for the vcpkg port
is 10.1 but CUDA 11.x is the newest release from NVIDIA.

The auto-download is a problem because cuDNN and NCCL versions that are
used with a particular CUDA version have to be built with that version
of CUDA -- one can't use a cuDNN version that was built with CUDA 10.1
with a CUDA 11 installation - things won't link (and they shouldn't -
this is by design).

Since not finding NCCL or cuDNN results in downloading the bits and
those bits are frozen at CUDA 10.1 versions, using CUDA 11 with
vcpkg but not having NCCL or cuCUDNN installed and trying to
./vcpkg install cudnn nccl results in broken behavior for projects
that are attempting to link downstream since linking to cuDNN or NCCL
as downloaded by the port implicitly adds a runtime dependency to a
CUDA 10.1 shared lib, which might not be on the system if that's not
the installed version (common these days given CUDA 11 is out).

This PR guts that download behavior entirely so that we don't give
people cuDNN or NCCL versions that aren't compatible with their
CUDA installations.

Which triplets are supported/not supported? Have you updated the CI baseline?

No changes to supported triplets.

Does your PR follow the maintainer guide?

Yes.

Automatically downloading cuDNN or NCCL packages leads to some bad behavior, especially given that the minumum CUDA version required for the vcpkg port is 10.1 but CUDA 11.x is the newest release from NVIDIA. The auto-download is a problem because cuDNN and NCCL versions that are used with a particular CUDA version have to be built with that version of CUDA -- one can't use a cuDNN version that was built with CUDA 10.1 with a CUDA 11 installation - things won't link (and they shouldn't - this is by design). Since not finding NCCL or cuDNN results in downloading the bits and those bits are frozen at CUDA 10.1 versions, using CUDA 11 with vcpkg but not having NCCL or cuCUDNN installed and trying to `./vcpkg install cudnn nccl` results in broken behavior for projects that are attempting to link downstream since linking to cuDNN or NCCL as downloaded by the port implicitly adds a runtime dependency to a CUDA 10.1 shared lib, which might not be on the system if that's not the installed version (common these days given CUDA 11 is out). This PR guts that download behavior entirely so that we don't give people cuDNN or NCCL versions that aren't compatible with their CUDA installations.

JackBoosY · 2021-02-04T03:01:15Z

Starting package 2/2: cudnn:x64-windows
Building package cudnn[core]:x64-windows...
CMake Error at ports/cudnn/portfile.cmake:63 (message):
  Could not find cuDNN.  Before continuing, please download and install a
  cuDNN version

  that matches your CUDA version from:

      https://developer.nvidia.com/cuDNN

Call Stack (most recent call first):
  scripts/ports.cmake:128 (include)


Error: Building package cudnn:x64-windows failed with: BUILD_FAILED
Elapsed time for package cudnn:x64-windows: 147.1 ms

cenit · 2021-02-04T15:51:30Z

IMHO, that's not a good approach. Also because you're removing a nice feature upon which someone like me might have built a solution.

What do you think about doing this approach: both ports should trigger a FindCUDA and detect installed version. Based on installed version, if the external version is present, accept it, otherwise download the most appropriate version. If the CUDA version is unknown to the portfile and so it doesn't know the appropriate version to download and no external version is present, than a fatal error is acceptable.

jacobkahn · 2021-02-04T16:27:11Z

@cenit — it's definitely a nice feature and I'd want to do something like this — the problem is that NVIDIA isn't officially supporting Conda packages for cuDNN and NCCL (I asked them), so we're unlikely to ever get CUDA > 10.1-compatible cuDNN and NCCL packages on there. NCCL only has a package for 10.1 based on the source I'm using. It's also not immediately clear that it's possible to call vcpkg_find_cuda from another portfile and extract the version (I gave it a try), cc @JackBoosY @BillyONeal.

@JackBoosY — these failures aren't unexpected/CI machines will need to be updated given this approach. But I'll circle back about that.

cenit · 2021-02-04T16:30:28Z

Oh that's a bummer. I mean, the fact that you asked and they explicitly said it's unsupported.
So we have to abandon auto download feature of cudnn?

About cuda version, I was thinking about doing a find_package(CUDA) directly, but it was just in my mind, and maybe not the cleanest vcpkg approach...

jacobkahn · 2021-02-04T17:44:00Z

@cenit — vcpkg portfiles are run in script mode (cmake -P), so calls to find_package that use features that aren't available in script mode don't work — I ran into this when dropping find_package(CUDA) into these portfiles. It's really annoying that there isn't good support that would enable this... at the same time, auto-download only worked for CUDA 10.1, which a lot of people aren't using anymore. Probably best to gut this to avoid the breakage/confusion I mentioned in the PR summary for people using later versions.

jacobkahn · 2021-02-04T17:45:26Z

Starting package 2/2: cudnn:x64-windows
Building package cudnn[core]:x64-windows...
CMake Error at ports/cudnn/portfile.cmake:63 (message):
  Could not find cuDNN.  Before continuing, please download and install a
  cuDNN version

  that matches your CUDA version from:

      https://developer.nvidia.com/cuDNN

Call Stack (most recent call first):
  scripts/ports.cmake:128 (include)


Error: Building package cudnn:x64-windows failed with: BUILD_FAILED
Elapsed time for package cudnn:x64-windows: 147.1 ms

@JackBoosY — we'd need to add cuDNN (and NCCL?) to the CI machines that have this failure since we're no longer auto-downloading. I'm happy to help point you to the correct version depending on which CUDA version is installed on these machines (I'm guessing 10.1 since the cuDNN autodownload works in CI as it is, and it wouldn't link if it were any other CUDA version...)

BillyONeal · 2021-02-04T17:53:33Z

@JackBoosY — we'd need to add cuDNN (and NCCL?) to the CI machines that have this failure since we're no longer auto-downloading. I'm happy to help point you to the correct version depending on which CUDA version is installed on these machines (I'm guessing 10.1 since the cuDNN autodownload works in CI as it is, and it wouldn't link if it were any other CUDA version...)

It's CUDA 10.something. Each time we've tried to update it the install fails but we don't know why; we've been looking for someone familiar with CUDA to help diagnose' see install script here:

vcpkg/scripts/azure-pipelines/windows/provision-image.txt

Line 373 in f226416

<#

The prevailing theory I've heard is that newer CUDA tries to install the NVidia driver which fails because our servers don't have NVidia GPUs.

jacobkahn · 2021-02-04T20:14:23Z

@mtmd might you have any wisdom on the above or be able to cc/connect us to others who might be able to resolve this? This affects a bunch of CUDA/cuDNN-dependent projects that are built with vcpkg.

mtmd · 2021-02-05T02:19:19Z

@mtmd might you have any wisdom on the above or be able to cc/connect us to others who might be able to resolve this? This affects a bunch of CUDA/cuDNN-dependent projects that are built with vcpkg.

@jacobkahn Thank you for pointing this out, Jacob. Let me raise the issue internally to find someone to address it.

jacobkahn · 2021-02-27T04:18:32Z

Superseded by #16413

jacobkahn force-pushed the nodownload_cudnn_nccl branch from 33e1ab0 to 8738819 Compare February 3, 2021 22:08

JonLiu1993 assigned JonLiu1993 and JackBoosY and unassigned JonLiu1993 Feb 4, 2021

JackBoosY added the category:port-bug The issue is with a library, which is something the port should already support label Feb 4, 2021

JackBoosY added the requires:author-response label Feb 4, 2021

cenit mentioned this pull request Feb 25, 2021

[cudnn] auto-download only if cuda version matches #16413

Merged

jacobkahn closed this Feb 27, 2021

jacobkahn deleted the nodownload_cudnn_nccl branch February 27, 2021 04:18

jacobkahn mentioned this pull request Apr 21, 2021

[nccl] Remove NCCL auto-download due to licensing issues #17431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudnn][nccl] Don't download bits if not found#16031

[cudnn][nccl] Don't download bits if not found#16031
jacobkahn wants to merge 1 commit intomicrosoft:masterfrom
jacobkahn:nodownload_cudnn_nccl

jacobkahn commented Feb 3, 2021

Uh oh!

JackBoosY commented Feb 4, 2021

Uh oh!

cenit commented Feb 4, 2021 •

edited

Loading

Uh oh!

jacobkahn commented Feb 4, 2021

Uh oh!

cenit commented Feb 4, 2021

Uh oh!

jacobkahn commented Feb 4, 2021

Uh oh!

jacobkahn commented Feb 4, 2021 •

edited

Loading

Uh oh!

BillyONeal commented Feb 4, 2021

Uh oh!

jacobkahn commented Feb 4, 2021

Uh oh!

mtmd commented Feb 5, 2021

Uh oh!

jacobkahn commented Feb 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jacobkahn commented Feb 3, 2021

Uh oh!

JackBoosY commented Feb 4, 2021

Uh oh!

cenit commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobkahn commented Feb 4, 2021

Uh oh!

cenit commented Feb 4, 2021

Uh oh!

jacobkahn commented Feb 4, 2021

Uh oh!

jacobkahn commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BillyONeal commented Feb 4, 2021

Uh oh!

jacobkahn commented Feb 4, 2021

Uh oh!

mtmd commented Feb 5, 2021

Uh oh!

jacobkahn commented Feb 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cenit commented Feb 4, 2021 •

edited

Loading

jacobkahn commented Feb 4, 2021 •

edited

Loading