fix `pytorchWithCuda`, fix `cupy`, upgrade `cudnn` #166784

samuela · 2022-04-01T08:13:07Z

Description of changes

This PR accomplishes a few separate but related tasks:

Fix pytorchWithCuda. This is concurrent work to python3Packages.pytorchWithCuda: consistent cuda #166734.
Upgrade cudnn 7.6.5 -> 8.3.2.
Fix cutensor_cudatoolkit_{10,11}.
Harden the build of cupy, and enable parallel building.

Although I try to keep PRs as small as possible, in this case it became clear that it was much more straightforward to make these changes simultaneously than separately. I understand that increases the burden on reviewers to some extent, but I tried to keep each commit as small as possible to ease the process.

This PR is the equivalent of #166533 but targeting master instead of staging.

According to nixpkgs-review this affects packages pytorchWithCuda, Theano, and cupy. I've confirmed that all of those nix-build successfully. For some reason, nixpkgs-review wip doesn't work in this case, but everything seems to build just fine.

Fixes #161843, #166403, and possibly others.

Things done

samuela · 2022-04-01T08:24:39Z

pkgs/development/libraries/science/math/cudnn/default.nix

+  cudnn_8_3_cudatoolkit_10_2 = generic rec {
+    version = "8.3.2";
+    cudatoolkit = cudatoolkit_10_2;
+    # See https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-832/support-matrix/index.html#cudnn-cuda-hardware-versions.
+    minCudaVersion = "10.2.00000";
+    maxCudaVersion = "11.5.99999";
+    mkSrc = cudatoolkit:
+      let v = if lib.versions.majorMinor cudatoolkit.version == "10.2" then "10.2" else "11.5"; in
+      fetchurl {
+        # Starting at version 8.3.1 there's a new directory layout including
+        # a subdirectory `local_installers`.
+        url = "https://developer.download.nvidia.com/compute/redist/cudnn/v${version}/local_installers/${v}/cudnn-linux-x86_64-8.3.2.44_cuda${v}-archive.tar.xz";
+        hash = {
+          "10.2" = "sha256-1vVu+cqM+PketzIQumw9ykm6REbBZhv6/lXB7EC2aaw=";
+          "11.5" = "sha256-VQCVPAjF5dHd3P2iNPnvvdzb5DpTsm3AqCxyP6FwxFc=";
+        }."${v}";
+      };
+  };


This change is purely a matter of formatting.

samuela · 2022-04-01T08:26:25Z

pkgs/development/python-modules/cupy/default.nix

+assert cudnn.cudatoolkit == cudatoolkit;
+assert cutensor.cudatoolkit == cudatoolkit;
+assert nccl.cudatoolkit == cudatoolkit;


It was discovered here that cupy was accidentally pulling in multiple cudatoolkit versions. These asserts should prevent that issue going forward.

samuela · 2022-04-01T08:27:30Z

pkgs/development/libraries/science/math/nccl/default.nix

+  passthru = {
+    inherit cudatoolkit;
+  };


This is necessary for the assert in cupy/default.nix.

SomeoneSerge · 2022-04-02T16:04:42Z

It might be an idea to use passthru
... another option is to make a cudaPackages for each cuda version

Indeed, we've been considering both, the first as a temporary fix, and the latter as a longer-term solution.
In this PR the passthru is used for asserts which should already somewhat improve the experience.
Note that just dropping cudatoolkit argument in e.g. pytorch/default.nix and using the passthru does not eliminate the need for asserts, because all of magma, nccl, and cudnn bring in their cudatoolkit and need to be consistent.
The cudaPackages is probably the best UX we can get at the moment

For example, having cudnn in the passthru of cudatoolkit

Hmm, I see. Currently it's the other way around (cudnn has cudatoolkit in passthru). Also, there are several releases of cudnn in nixpkgs. Again, cudaPackages seems to be a solution

samuela · 2022-04-02T17:04:36Z

there are old related problems (overriding cuda at the user site and forgetting to override cudnn) which we need to address, but don't need to address in this PR

It might be an idea to use passthru. For example, having cudnn in the passthru of cudatoolkit and add cudatoolkit to the passthru of packages using it. Another option is to make a cudaPackages set for each cuda version that contains cudnn and others and add that in the passthru of the cudatoolkit package. Similar to what we do with say Python.

On the topic of a cudaPackages package set, check out the proposal here: #163704

Re passthru: I think that's absolutely the way to go for the time being. This PR is already rather beefy, so I'd prefer to accomplish that separately, but I totally agree that we ought to do so.

samuela · 2022-04-02T17:13:24Z

@FRidh @SomeoneSerge Created an issue to track the torchvision cudatoolkit discrepancy: #166948

samuela · 2022-04-02T19:14:51Z

If no one objects, I'll go ahead and merge tomorrow

SomeoneSerge · 2022-04-03T01:56:00Z

python39Packages.caffe used to succeed and now fails with

caffe> -- Found cuDNN: ver. ??? found (include: /nix/store/jrj8x3764iq0iz2gqa1pdv24l08ihl6k-cudatoolkit-10.2-cudnn-8.3.2/include, library: /nix/store/
jrj8x3764iq0iz2gqa1pdv24l08ihl6k-cudatoolkit-10.2-cudnn-8.3.2/lib/libcudnn.so)
caffe> CMake Error at cmake/Cuda.cmake:221 (message):
caffe>   cuDNN version >3 is required.

EDIT: tested like below

with (import ./. {
  config.allowUnfree = true;
  config.cudaSupport = true;
});

mkShell {
  buildInputs = [
    python310Packages.caffe
  ];
}

or like this:

$ nix build github:SomeoneSerge/nixpkgs-unfree/e244a94864998f2b3edd927a6b4b4680718f406b#python39Packages.caffe

samuela · 2022-04-03T02:03:06Z

python39Packages.caffe used to succeed and now fails with

Hmm I'm not seeing this:

ubuntu in threedoo in 172.30.0.108 nixpkgs on  samuela/cudnn3 took 16s
❯ nix-build --max-jobs 64 -A python3Packages.caffe
/nix/store/bbca44xgdj1nl19lzk70kni0hjdppagf-caffe-1.0-bin

SomeoneSerge · 2022-04-03T02:25:58Z

Relaying a conversation from #cuda:matrix.org, there's an alternative pathway.
Instead of keeping cudatoolkit = cudatoolkit_10 and passing custom cuda, cudnn, magma and nccl to pytorch, we can do:

Major updates for both cuda and cudnn in this PR, fixing pytorch and cupy

In particular:
- cudatoolkit = cudatoolkit_11 = cudatoolkit_11_4
- cudnn = cudnn_8_3_cudatoolkit_11
There's a jobset running for this combination: 143. But I've already verified that the current pytorch (1.10) builds with cuda 11.3 and 11.4 (reminder: it has been failing with 11.5 and, apparently, 11.1; the EmbeddingBag.cu error)

We revert passing additional arguments in the call to the pytorch derivation (cudatoolkit, ..., nccl), simplifying the PR and keeping the simple interface for overlays. This we would otherwise still do later, because pytorch=1.11 wouldn't require overriding
When pytorch 1.11 (now in staging) reaches master, we set cudatoolkit_11 = cudatoolkit_11_5, which is a minor update from 11.4 to 11.5 (context: the new pytorch doesn't have trouble building with cuda 11.5)

I think in this case there will be exactly one package left that still uses cuda 10 (seems to be libtensorflow-bin)

I'm not opposed to either solution. I prefer this alternative

The current version of caffe, v1.0, does not build with any later versions of cuDNN and CUDA 10.1 is the latest CUDA that is supported by cuDNN 7.6.

samuela · 2022-04-03T03:12:17Z

@SomeoneSerge I added a caffeWithCuda alias in a60ab2a and it's building successfully now. I didn't change anything with caffe2 as that is marked as broken rn. So now everything is building successfully as far as I'm aware.

samuela · 2022-04-03T03:20:48Z

Major updates for both cuda and cudnn in this PR, fixing pytorch and cupy

I'm in favor of doing this, although I hesitate to include those changes in this PR as it has gotten so huge already. What if we merged this one and then followed up immediately after with a PR that bumps the cudatoolkit and cudatoolkit_11 versions? I believe that would get us the same end result but with smaller steps along the way.

SomeoneSerge · 2022-04-03T03:28:50Z

I thought the major bumps would allow to drop some of the changes and reduce the size of PR... but actually, merging the asserts and fixing pytorch+cupy before cleaning up sounds tempting

samuela · 2022-04-03T03:31:49Z

I thought the major bumps would allow to drop some of the changes and reduce the size of PR... but actually, merging the asserts and fixing pytorch+cupy before cleaning up sounds tempting

Ah I see what you mean. Perhaps that is true. But I suspect that until we get pytorch v1.11 we won't be able to get out of specifying custom cudatoolkit versions since I don't think v1.10.2 supports cudatoolkit = cudatoolkit_11 = cudatoolkit_11_5. I'm not entirely sure though.

SomeoneSerge · 2022-04-03T03:32:36Z

since I don't think v1.10.2 supports cudatoolkit = cudatoolkit_11 = cudatoolkit_11_5. I'm not entirely sure though.

This is why I was talking about a downgrade to 11.4

samuela · 2022-04-03T03:33:08Z

Ohh I see. sorry I misunderstood!

SomeoneSerge

Overall, I feel OK merging this now and immediately following up with a clean-up PR that would adjust the cudnn and cudatoolkit attributes so as to minimize the total number of ad hoc parameters passed to callPackages: the first step makes the attributes of nixpkgs buildable, the second fixes the overlays UX, which seems like a reasonable allocation of priorities

SomeoneSerge · 2022-04-03T03:29:51Z

pkgs/development/libraries/science/math/cutensor/generic.nix

@@ -21,7 +21,7 @@ stdenv.mkDerivation {

  src = fetchurl {
    url = "https://developer.download.nvidia.com/compute/cutensor/${mostOfVersion}/local_installers/libcutensor-${stdenv.hostPlatform.parsed.kernel.name}-${stdenv.hostPlatform.parsed.cpu.name}-${version}.tar.gz";
-    inherit sha256;
+    inherit hash;


Why hash vs sha256?

Iiuc hash is preferred to sha256 as it is more future proof, but I don't remember the reference for that

# fetchurl/default.nix , # SRI hash. hash ? "" , # Legacy ways of specifying the hash. outputHash ? "" , outputHashAlgo ? "" , md5 ? "" , sha1 ? "" , sha256 ? "" , sha512 ? ""

Hmm, I guess

yep, we are switching everywhere (though slowly) to SRI hashes.

SomeoneSerge · 2022-04-03T03:30:10Z

pkgs/development/libraries/science/math/nccl/default.nix

+  passthru = {
+    inherit cudatoolkit;
+  };


samuela added 10 commits April 1, 2022 01:56

python3Packages.pytorchWithCuda: fix build

11d9ef1

cudatoolkit: add note about alias upgrades

e1b9a52

cudnn: formatting

9d9138c

cudnn_8_1_cudatoolkit_11: remove

3db04d2

cutensor: fix cutensor_cudatoolkit_{10,11}

089dafd

cudnn: 7.6.5 -> 8.3.2

147606d

nccl: expose cudatoolkit in passthru

03dd3f6

python3Packages.cupy: cudatoolkit consistency

2133acc

python3Packages.cupy: parallel build

e51d84a

python3Packages.cupy: use the latest cudnn

2a0acdb

samuela requested review from mdaiter, SomeoneSerge, hyphon81 and a team April 1, 2022 08:13

samuela requested review from FRidh and jonringer as code owners April 1, 2022 08:13

github-actions bot added the 6.topic: python label Apr 1, 2022

samuela requested review from teh, thoughtpolice, tscholak and orivej-nixos April 1, 2022 08:14

samuela mentioned this pull request Apr 1, 2022

cudnn: 7.6.5 -> 8.3.2 #166533

Closed

13 tasks

samuela commented Apr 1, 2022

View reviewed changes

ofborg bot added 8.has: clean-up 10.rebuild-darwin: 1-10 10.rebuild-linux: 1-10 labels Apr 1, 2022

samuela commented Apr 1, 2022

View reviewed changes

samuela mentioned this pull request Apr 1, 2022

python3Packages.pytorchWithCuda: consistent cuda #166734

Closed

5 tasks

GuillaumeDesforges mentioned this pull request Apr 1, 2022

python3Packages.cupy: fix broken #166276

Closed

13 tasks

samuela mentioned this pull request Apr 2, 2022

Ensure that torchvision uses the same cudatoolkit version as pytorch #166948

Closed

SomeoneSerge mentioned this pull request Apr 2, 2022

[proposal] split up cudatoolkit package into its constituent pieces #163704

Closed

3 tasks

This comment was marked as resolved.

Sign in to view

SomeoneSerge mentioned this pull request Apr 3, 2022

python3Packages.pytorchWithCuda build failure on x86_64-linux as of 3058e76a #161843

Closed

samuela added 3 commits April 3, 2022 03:07

caffe: use CUDA 10.1, cuDNN 7.6

1ef351f

The current version of caffe, v1.0, does not build with any later versions of cuDNN and CUDA 10.1 is the latest CUDA that is supported by cuDNN 7.6.

caffeWithCuda: init at 1.0

a60ab2a

python3Packages.caffeWithCuda: init at 1.0

8b9bbbc

ofborg bot added the 8.has: package (new) label Apr 3, 2022

SomeoneSerge approved these changes Apr 3, 2022

View reviewed changes

FRidh added the 6.topic: cuda label Apr 3, 2022

FRidh merged commit c772814 into NixOS:master Apr 3, 2022

SomeoneSerge mentioned this pull request Apr 3, 2022

CUDA in 22.05 #167068

Closed

15 tasks

samuela deleted the samuela/cudnn3 branch April 3, 2022 19:20

This was referenced Apr 3, 2022

22.05 Feature Freeze #167025

Closed

python3Packages.pytorchWithCuda: uses unsupported cudatoolkit 10.1 #166403

Closed

cudatoolkit: 10.2 -> 11.5 #167250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix `pytorchWithCuda`, fix `cupy`, upgrade `cudnn` #166784

fix `pytorchWithCuda`, fix `cupy`, upgrade `cudnn` #166784

samuela commented Apr 1, 2022 •

edited

Loading

samuela Apr 1, 2022

samuela Apr 1, 2022

samuela Apr 1, 2022

SomeoneSerge Apr 3, 2022

SomeoneSerge commented Apr 2, 2022

samuela commented Apr 2, 2022

samuela commented Apr 2, 2022

samuela commented Apr 2, 2022

SomeoneSerge commented Apr 3, 2022 •

edited

Loading

samuela commented Apr 3, 2022

SomeoneSerge commented Apr 3, 2022

This comment was marked as resolved.

This comment was marked as resolved.

samuela commented Apr 3, 2022

samuela commented Apr 3, 2022

SomeoneSerge commented Apr 3, 2022

samuela commented Apr 3, 2022

SomeoneSerge commented Apr 3, 2022

samuela commented Apr 3, 2022

SomeoneSerge left a comment

SomeoneSerge Apr 3, 2022

samuela Apr 3, 2022

SomeoneSerge Apr 3, 2022 •

edited

Loading

FRidh Apr 3, 2022

SomeoneSerge Apr 3, 2022

fix pytorchWithCuda, fix cupy, upgrade cudnn #166784

fix pytorchWithCuda, fix cupy, upgrade cudnn #166784

Conversation

samuela commented Apr 1, 2022 • edited Loading

Description of changes

Things done

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SomeoneSerge commented Apr 2, 2022

samuela commented Apr 2, 2022

samuela commented Apr 2, 2022

samuela commented Apr 2, 2022

SomeoneSerge commented Apr 3, 2022 • edited Loading

samuela commented Apr 3, 2022

SomeoneSerge commented Apr 3, 2022

This comment was marked as resolved.

This comment was marked as resolved.

samuela commented Apr 3, 2022

samuela commented Apr 3, 2022

SomeoneSerge commented Apr 3, 2022

samuela commented Apr 3, 2022

SomeoneSerge commented Apr 3, 2022

samuela commented Apr 3, 2022

SomeoneSerge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SomeoneSerge Apr 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fix `pytorchWithCuda`, fix `cupy`, upgrade `cudnn` #166784

fix `pytorchWithCuda`, fix `cupy`, upgrade `cudnn` #166784

samuela commented Apr 1, 2022 •

edited

Loading

SomeoneSerge commented Apr 3, 2022 •

edited

Loading

SomeoneSerge Apr 3, 2022 •

edited

Loading