Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pytorchWithCuda, fix cupy, upgrade cudnn #166784

Merged
merged 13 commits into from
Apr 3, 2022
Merged

Conversation

samuela
Copy link
Member

@samuela samuela commented Apr 1, 2022

Description of changes

This PR accomplishes a few separate but related tasks:

Although I try to keep PRs as small as possible, in this case it became clear that it was much more straightforward to make these changes simultaneously than separately. I understand that increases the burden on reviewers to some extent, but I tried to keep each commit as small as possible to ease the process.

This PR is the equivalent of #166533 but targeting master instead of staging.

According to nixpkgs-review this affects packages pytorchWithCuda, Theano, and cupy. I've confirmed that all of those nix-build successfully. For some reason, nixpkgs-review wip doesn't work in this case, but everything seems to build just fine.

Fixes #161843, #166403, and possibly others.

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.05 Release Notes (or backporting 21.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

Comment on lines +87 to +104
cudnn_8_3_cudatoolkit_10_2 = generic rec {
version = "8.3.2";
cudatoolkit = cudatoolkit_10_2;
# See https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-832/support-matrix/index.html#cudnn-cuda-hardware-versions.
minCudaVersion = "10.2.00000";
maxCudaVersion = "11.5.99999";
mkSrc = cudatoolkit:
let v = if lib.versions.majorMinor cudatoolkit.version == "10.2" then "10.2" else "11.5"; in
fetchurl {
# Starting at version 8.3.1 there's a new directory layout including
# a subdirectory `local_installers`.
url = "https://developer.download.nvidia.com/compute/redist/cudnn/v${version}/local_installers/${v}/cudnn-linux-x86_64-8.3.2.44_cuda${v}-archive.tar.xz";
hash = {
"10.2" = "sha256-1vVu+cqM+PketzIQumw9ykm6REbBZhv6/lXB7EC2aaw=";
"11.5" = "sha256-VQCVPAjF5dHd3P2iNPnvvdzb5DpTsm3AqCxyP6FwxFc=";
}."${v}";
};
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is purely a matter of formatting.

Comment on lines +8 to +10
assert cudnn.cudatoolkit == cudatoolkit;
assert cutensor.cudatoolkit == cudatoolkit;
assert nccl.cudatoolkit == cudatoolkit;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was discovered here that cupy was accidentally pulling in multiple cudatoolkit versions. These asserts should prevent that issue going forward.

Comment on lines +41 to +43
passthru = {
inherit cudatoolkit;
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for the assert in cupy/default.nix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@SomeoneSerge
Copy link
Contributor

It might be an idea to use passthru
... another option is to make a cudaPackages for each cuda version

Indeed, we've been considering both, the first as a temporary fix, and the latter as a longer-term solution.
In this PR the passthru is used for asserts which should already somewhat improve the experience.
Note that just dropping cudatoolkit argument in e.g. pytorch/default.nix and using the passthru does not eliminate the need for asserts, because all of magma, nccl, and cudnn bring in their cudatoolkit and need to be consistent.
The cudaPackages is probably the best UX we can get at the moment

For example, having cudnn in the passthru of cudatoolkit

Hmm, I see. Currently it's the other way around (cudnn has cudatoolkit in passthru). Also, there are several releases of cudnn in nixpkgs. Again, cudaPackages seems to be a solution

@samuela
Copy link
Member Author

samuela commented Apr 2, 2022

there are old related problems (overriding cuda at the user site and forgetting to override cudnn) which we need to address, but don't need to address in this PR

It might be an idea to use passthru. For example, having cudnn in the passthru of cudatoolkit and add cudatoolkit to the passthru of packages using it. Another option is to make a cudaPackages set for each cuda version that contains cudnn and others and add that in the passthru of the cudatoolkit package. Similar to what we do with say Python.

On the topic of a cudaPackages package set, check out the proposal here: #163704

Re passthru: I think that's absolutely the way to go for the time being. This PR is already rather beefy, so I'd prefer to accomplish that separately, but I totally agree that we ought to do so.

@samuela
Copy link
Member Author

samuela commented Apr 2, 2022

@FRidh @SomeoneSerge Created an issue to track the torchvision cudatoolkit discrepancy: #166948

@samuela
Copy link
Member Author

samuela commented Apr 2, 2022

If no one objects, I'll go ahead and merge tomorrow

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Apr 3, 2022

python39Packages.caffe used to succeed and now fails with

caffe> -- Found cuDNN: ver. ??? found (include: /nix/store/jrj8x3764iq0iz2gqa1pdv24l08ihl6k-cudatoolkit-10.2-cudnn-8.3.2/include, library: /nix/store/
jrj8x3764iq0iz2gqa1pdv24l08ihl6k-cudatoolkit-10.2-cudnn-8.3.2/lib/libcudnn.so)
caffe> CMake Error at cmake/Cuda.cmake:221 (message):
caffe>   cuDNN version >3 is required.

EDIT: tested like below

with (import ./. {
  config.allowUnfree = true;
  config.cudaSupport = true;
});

mkShell {
  buildInputs = [
    python310Packages.caffe
  ];
}

or like this:

$ nix build github:SomeoneSerge/nixpkgs-unfree/e244a94864998f2b3edd927a6b4b4680718f406b#python39Packages.caffe

@samuela
Copy link
Member Author

samuela commented Apr 3, 2022

python39Packages.caffe used to succeed and now fails with

Hmm I'm not seeing this:

ubuntu in threedoo in 172.30.0.108 nixpkgs on  samuela/cudnn3 took 16s
❯ nix-build --max-jobs 64 -A python3Packages.caffe
/nix/store/bbca44xgdj1nl19lzk70kni0hjdppagf-caffe-1.0-bin

@SomeoneSerge
Copy link
Contributor

Relaying a conversation from #cuda:matrix.org, there's an alternative pathway.
Instead of keeping cudatoolkit = cudatoolkit_10 and passing custom cuda, cudnn, magma and nccl to pytorch, we can do:

  • Major updates for both cuda and cudnn in this PR, fixing pytorch and cupy

    In particular:

    • cudatoolkit = cudatoolkit_11 = cudatoolkit_11_4
    • cudnn = cudnn_8_3_cudatoolkit_11

    There's a jobset running for this combination: 143. But I've already verified that the current pytorch (1.10) builds with cuda 11.3 and 11.4 (reminder: it has been failing with 11.5 and, apparently, 11.1; the EmbeddingBag.cu error)

    We revert passing additional arguments in the call to the pytorch derivation (cudatoolkit, ..., nccl), simplifying the PR and keeping the simple interface for overlays. This we would otherwise still do later, because pytorch=1.11 wouldn't require overriding

  • When pytorch 1.11 (now in staging) reaches master, we set cudatoolkit_11 = cudatoolkit_11_5, which is a minor update from 11.4 to 11.5 (context: the new pytorch doesn't have trouble building with cuda 11.5)

I think in this case there will be exactly one package left that still uses cuda 10 (seems to be libtensorflow-bin)

I'm not opposed to either solution. I prefer this alternative

@SomeoneSerge

This comment was marked as resolved.

@SomeoneSerge

This comment was marked as resolved.

The current version of caffe, v1.0, does not build with any later versions of cuDNN and CUDA 10.1 is the latest CUDA that is supported by cuDNN 7.6.
@samuela
Copy link
Member Author

samuela commented Apr 3, 2022

@SomeoneSerge I added a caffeWithCuda alias in a60ab2a and it's building successfully now. I didn't change anything with caffe2 as that is marked as broken rn. So now everything is building successfully as far as I'm aware.

@samuela
Copy link
Member Author

samuela commented Apr 3, 2022

Major updates for both cuda and cudnn in this PR, fixing pytorch and cupy

I'm in favor of doing this, although I hesitate to include those changes in this PR as it has gotten so huge already. What if we merged this one and then followed up immediately after with a PR that bumps the cudatoolkit and cudatoolkit_11 versions? I believe that would get us the same end result but with smaller steps along the way.

@SomeoneSerge
Copy link
Contributor

I thought the major bumps would allow to drop some of the changes and reduce the size of PR... but actually, merging the asserts and fixing pytorch+cupy before cleaning up sounds tempting

@samuela
Copy link
Member Author

samuela commented Apr 3, 2022

I thought the major bumps would allow to drop some of the changes and reduce the size of PR... but actually, merging the asserts and fixing pytorch+cupy before cleaning up sounds tempting

Ah I see what you mean. Perhaps that is true. But I suspect that until we get pytorch v1.11 we won't be able to get out of specifying custom cudatoolkit versions since I don't think v1.10.2 supports cudatoolkit = cudatoolkit_11 = cudatoolkit_11_5. I'm not entirely sure though.

@SomeoneSerge
Copy link
Contributor

since I don't think v1.10.2 supports cudatoolkit = cudatoolkit_11 = cudatoolkit_11_5. I'm not entirely sure though.

This is why I was talking about a downgrade to 11.4

@samuela
Copy link
Member Author

samuela commented Apr 3, 2022

Ohh I see. sorry I misunderstood!

Copy link
Contributor

@SomeoneSerge SomeoneSerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I feel OK merging this now and immediately following up with a clean-up PR that would adjust the cudnn and cudatoolkit attributes so as to minimize the total number of ad hoc parameters passed to callPackages: the first step makes the attributes of nixpkgs buildable, the second fixes the overlays UX, which seems like a reasonable allocation of priorities

@@ -21,7 +21,7 @@ stdenv.mkDerivation {

src = fetchurl {
url = "https://developer.download.nvidia.com/compute/cutensor/${mostOfVersion}/local_installers/libcutensor-${stdenv.hostPlatform.parsed.kernel.name}-${stdenv.hostPlatform.parsed.cpu.name}-${version}.tar.gz";
inherit sha256;
inherit hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hash vs sha256?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iiuc hash is preferred to sha256 as it is more future proof, but I don't remember the reference for that

Copy link
Contributor

@SomeoneSerge SomeoneSerge Apr 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# fetchurl/default.nix

, # SRI hash.
  hash ? ""

, # Legacy ways of specifying the hash.
  outputHash ? ""
, outputHashAlgo ? ""
, md5 ? ""
, sha1 ? ""
, sha256 ? ""
, sha512 ? ""

Hmm, I guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we are switching everywhere (though slowly) to SRI hashes.

Comment on lines +41 to +43
passthru = {
inherit cudatoolkit;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

python3Packages.pytorchWithCuda build failure on x86_64-linux as of 3058e76a
4 participants