Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3Packages.pytorchWithCuda: consistent cuda #166734

Closed

Conversation

SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Apr 1, 2022

Pytorch has been using (through unversioned attributes) cudatoolkit=10.2 (10.1 in nixpkgs-unstable)
and cudnn_7_6_cudatoolkit_10_1; the older 10.1 took priority; pytorch
1.10 requires at least 10.2, which lead to a build failure: #166403

Description of changes
  • Calling pytorch with consistent cudnn and cudatoolkit versions.
  • Choosing to use cudnn_8_3_cudatoolkit_11 (cuda 11.5, atm) because this is what I expect to be the new value of the unversioned cuda*-attributes after cudnn: 7.6.5 -> 8.3.2 #166533 lands (we might be able to re-use the cache later)

This is a hotfix for pytorch to stay buildable on nixpkgs-unstable.
The longer-term solution is #166533.
This causes an additional magma build, but (a) we're going to provide cache, (b) with #166533 the global magma will be using the same version of cuda and we can get rid of the ad hoc fix we introduce here
In future, we'll want to remove these custom arguments, because they make overriding (e.g. global cudnn/cuda) harder

Note that this change is "local" in the sense that it should only affect the "cuda" world, not built by hydra, but tested by @NixOS/cuda-maintainers. This might affect cuda-enabled pytorch on darwin, because we currently do not have any capabilities to test darwin

This can be merged if 113 has fewer failures than 97, in particular pytorch This should fix the configurePhase, but the later build, apparently, is going to fail: #161843. That should be addressed in a separate PR. Or maybe I should call it off and wait for the cudnn PR to get merged - I'll sleep on this thought and wait for the build

Things done
  • Built on platform(s)
    • x86_64-linux (waiting)
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.05 Release Notes (or backporting 21.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
  • Fits CONTRIBUTING.md.

cc @NixOS/cuda-maintainers
cc @tscholak about cuda-enabled pytorch on darwin

pytorch has been using (through unversioned attributes) cudatoolkit=10.2
and cudnn_7_6_cudatoolkit_10_1; the older 10.1 took priority; pytorch
1.10 requires at least 10.2
@samuela
Copy link
Member

samuela commented Apr 1, 2022

I was just about to submit a PR with

  pytorch = callPackage ../development/python-modules/pytorch {
    cudaSupport = pkgs.config.cudaSupport or false;

    # TODO: next time pytorch is updated (to 1.11.0, currently in staging as of
    # 2022-03-31), make the following changes:

    # -> cudatoolk_11
    cudatoolkit = pkgs.cudatoolkit_10;

    # -> cudnn_8_3_cudatoolkit_11
    cudnn = pkgs.cudnn_8_1_cudatoolkit_10;

    # -> cutensor_cudatoolkit_11 (cutensor is a new dependency in v1.11.0)
    # cutensor = pkgs.cutensor_cudatoolkit_11;

    # -> setting a custom magma should be unnecessary with v1.11.0
    magma = pkgs.magma.override { cudatoolkit = pkgs.cudatoolkit_10; };

    # -> nccl_cudatoolkit_11
    nccl = pkgs.nccl.override { cudatoolkit = pkgs.cudatoolkit_10; };
  };

which also builds successfully and has the advantage of only relying on a single cudatoolkit version. I'm happy to merge either solution as this will all be obviated once staging gets merged into master.

Comment on lines +8365 to +8366
cudnn = pkgs.cudnn_8_3_cudatoolkit_11;
cudatoolkit = cudnn.cudatoolkit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC pkgs.cudnn_8_3_cudatoolkit_11 will pull in cudatoolkit_11_5 and cudnn.cudatoolkit will pull in cudatoolkit_10_1. Just as a heads up.

This is still strictly better than broken, but worth noting that it still does pull in multiple CUDA versions AFAIU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what o_0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC cudnn currently points to cudnn_7_6_cudatoolkit_10_1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the diff in #166784

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait... maybe I'm being an idiot. I guess this is like a let-rec binding so cudnn refers to the definition just prior. ok in that case nvm!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😓 I thought for a moment the same, but

nix-repl> pkgs.cudnn_8_3_cudatoolkit_11.cudatoolkit.version
"11.5.0"

@SomeoneSerge
Copy link
Contributor Author

Fixed with #167016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants