Skip to content

python3Packages.pytorch: migrate to cudaPackages#168745

Closed
SomeoneSerge wants to merge 8 commits intoNixOS:masterfrom
SomeoneSerge:pytorch-redist-cuda
Closed

python3Packages.pytorch: migrate to cudaPackages#168745
SomeoneSerge wants to merge 8 commits intoNixOS:masterfrom
SomeoneSerge:pytorch-redist-cuda

Conversation

@SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Apr 15, 2022

Use the redistributable cuda packages, instead of runfile-based
cudatoolkit

Build log: https://gist.github.com/SomeoneSerge/4e5b5e9d9ee82c410eebad60d0cdc8f7

All went awry, it used cudatoolkit through cudnn. Reworking

EDIT later on 2022-04-15: Now it actually builds with cuda-redist! Rebased on the cudnn PR with removed propagatedBuildInputs, adjusted, and pushed

CC @NixOS/cuda-maintainers

Description of changes
Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.05 Release Notes (or backporting 21.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

@github-actions github-actions bot added the 6.topic: python Python is a high-level, general-purpose programming language. label Apr 15, 2022
@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Apr 15, 2022
@ofborg ofborg bot requested review from teh, thoughtpolice and tscholak April 15, 2022 03:17
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. labels Apr 15, 2022
Copy link
Member

@samuela samuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hell yeah! this looks dope. does it build and everything?

@FRidh
Copy link
Member

FRidh commented Apr 15, 2022

cudnn using libcublas instead of cudatoolkit if possible #168755

@SomeoneSerge SomeoneSerge marked this pull request as ready for review April 15, 2022 18:13
@SomeoneSerge
Copy link
Contributor Author

2022-04-15_21-12

Hooray!

@SomeoneSerge
Copy link
Contributor Author

@FRidh same point arises: this assumes redist packages are present unconditionally, i.e. it only works for cuda>=11. I think we don't have to support cuda10 in downstream derivations

@ofborg ofborg bot added 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. labels Apr 15, 2022
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. and removed 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. labels Apr 19, 2022
@SomeoneSerge
Copy link
Contributor Author

I'm not sure I'm ready for this to be merged:

@SomeoneSerge
Copy link
Contributor Author

Now magma uses redist packages too, reducing the reported closure size of pytorch from 12.75GiB to 10.15GiB (which is still too much, imo)
Kineto/cupti issue resolved just with adding libcupti to the closure

@SomeoneSerge SomeoneSerge marked this pull request as ready for review April 23, 2022 11:51
@SomeoneSerge
Copy link
Contributor Author

...note that this change doesn't support building magma with cuda10 (pre-cuda11.4)

We could account for those as well, but I thought it's ok to ignore because magma is outside cudaPackages namespace

@ofborg ofborg bot requested a review from tbenst April 23, 2022 12:09
@ofborg ofborg bot added 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-darwin: 1 This PR causes 1 package to rebuild on Darwin. and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. labels Apr 23, 2022
@SomeoneSerge
Copy link
Contributor Author

Interesting that magma has platforms = unix, but neither cudatoolkit nor redist cuda packages are marked with darwin support

@samuela
Copy link
Member

samuela commented Apr 24, 2022

Interesting that magma has platforms = unix, but neither cudatoolkit nor redist cuda packages are marked with darwin support

Although I think it may technically be possible to hook up NVIDIA hardware to a macOS system, I've never heard of anyone actually doing this... except maybe for kicks?

@SomeoneSerge
Copy link
Contributor Author

RE: magma.cudaPackages == cudaPackages

I've actually just noticed a bug with the original expression.
Evaluating passthru attributes of pytorchWithCuda on master1 triggers evaluation of cudatoolkit, resulting in an error:

❯ nix eval --impure --expr '((builtins.getFlake github:NixOS/nixpkgs/master).legacyPackages.x86_64-linux.python3Packages.pytorch).cudaSupport'
false
❯ nix eval --impure --expr '((builtins.getFlake github:NixOS/nixpkgs/master).legacyPackages.x86_64-linux.python3Packages.pytorchWithCuda).cudaSupport'
error: Package ‘cudatoolkit-11.6.1’ ... has an unfree license (‘unfree’), refusing to evaluate

This happens because of the following line in master:

assert !cudaSupport || magma.cudatoolkit == cudatoolkit;

So, just to verify, in this PR we don't seem to be repeating that same error, although the whole thing kind of smells

❯ nix eval --impure --expr '((builtins.getFlake (toString ./.)).legacyPackages.x86_64-linux.python3Packages.pytorch).cudaSupport'
false
❯ nix eval --impure --expr '((builtins.getFlake (toString ./.)).legacyPackages.x86_64-linux.python3Packages.pytorchWithCuda).cudaSupport'
true

Footnotes

  1. master is at 97eb40e0f8a9ec4e7306645dfc446f3cb4b9eed4 at the moment

@SomeoneSerge
Copy link
Contributor Author

I suspect we might still run into similar evaluation issues if we make an overlay resulting in lack of pointer-equality 🤔

@SomeoneSerge
Copy link
Contributor Author

pytorch = callPackage ../development/python-modules/pytorch {
    # ...
    cudaPackages = pkgs.cudaPackages.overrideScope' (final: prev: { });
}

Then pytorchWithCuda.cudaSupport results in

error: assertion '(cudaSupport -> ((magma).cudaPackages == cudaPackages))' failed

...and pytorch.cudaSupport evaluates into false

@FRidh
Copy link
Member

FRidh commented Apr 24, 2022

Might want to write a function that takes a list of sets and a list of packages and tests for every package whether it is equal in every set.

@SomeoneSerge
Copy link
Contributor Author

Just one thing I don't like about this is that the user cannot disable those asserts downstream, but I can live with that

@SomeoneSerge
Copy link
Contributor Author

@FRidh

Might want to write a function that takes a list of sets and a list of packages and tests for every package whether it is equal in every set.

  1. I have a feeling that referential equality is somehow more appropriate? Except we're relying on nix's internals
  2. If we compare the packages, we'd be at risk of getting that refusing to evaluate error again 🤔

@SomeoneSerge
Copy link
Contributor Author

We definitely should write our own equality function just for isolation, regardless of whether it falls back to == or tests only packages. This is out of scope of the PR, however

assert !cudaSupport || magma.cudatoolkit == cudatoolkit;
# We expect referential equality of all cudaPackages used to ensure consistency
# You can make an overlay and pass the same cudaPackages to pytorch, mpi, and magma
# TODO: `==` is an implementation detail; move comparison logic to cudaPackages
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Oct 30, 2022
@ConnorBaker
Copy link
Contributor

@SomeoneSerge if you don't mind, I'd like to take this up and continue where you left off.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Feb 24, 2023
@SomeoneSerge
Copy link
Contributor Author

@ConnorBaker this would be just great, please do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: cuda Parallel computing platform and API 6.topic: python Python is a high-level, general-purpose programming language. 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-darwin: 1 This PR causes 1 package to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants