python3Packages.torch: migrate to CUDA redist from CUDA Toolkit#249259
python3Packages.torch: migrate to CUDA redist from CUDA Toolkit#249259ConnorBaker merged 1 commit intoNixOS:masterfrom
Conversation
ada378d to
f81594e
Compare
f81594e to
99b4235
Compare
|
Rebased and force-pushed. |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/cuda-team-roadmap-update-2023-08-29/32379/1 |
0b1103d to
f9ceb36
Compare
I see what you did there :) |
|
@samuela @SomeoneSerge if either of you have any ideas on how to fix some of the to-dos, I’m all ears! |
|
Thanks to @jmillerpdt for pointing out Gloo's NCCL detection being a non-issue because of how limited Gloo/MPI's support for GPUs is! That's also going to save me some headaches on #239291. |
f9ceb36 to
ab915e5
Compare
|
Rebased, updated, and force-pushed. Regenerated the path-info outputs for the rebased branch and including the change to Gloo/MPI. |
|
Split one TODO into a new issue: #253867 |
ab915e5 to
1e9c682
Compare
|
Re kineto: does |
samuela
left a comment
There was a problem hiding this comment.
Thanks for putting this together @ConnorBaker ! This is a big step forward for our pytorch packaging imho
just one small request on my side
bb23778 to
2e5eca4
Compare
There was a problem hiding this comment.
Hm. Why does it need the LD_LIBRARY_PATH anymore?
There was a problem hiding this comment.
Setting doCheck = true;, it still works 🤷♂️
Should it not?
There was a problem hiding this comment.
I guess this makes sense, I'm not sure why the stub was ever needed (maybe some of the binaries had libcuda.so in DT_NEEDED, and now they only do dlopen?)
Good riddance then
|
@ConnorBaker @samuela sooo much appreciated, I can't wait to see this merged! @ConnorBaker I just posted a bunch of stupid questions that popped up as I skimmed through the diff, but I don't see any blockers |
2e5eca4 to
ea939b9
Compare
|
Alrighty! @samuela @SomeoneSerge would you mind taking one last look? @samuela it still shows as merging being blocked due to requested changes, but the requested changes don't show up. I imagine that's because I force-pushed them? |
samuela
left a comment
There was a problem hiding this comment.
besides resolving this comment, i think this is ready to go
|
Well this is disappointing: I wanted something that would allow us to keep the messages produced by our asserts without necessarily breaking evaluation of the derivation, but it appears that's what Consider this: {
changelog = "https://github.com/pytorch/pytorch/releases/tag/v${version}";
# keep PyTorch in the description so the package can be found under that name on search.nixos.org
description = "PyTorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration";
homepage = "https://pytorch.org/";
license = licenses.bsd3;
maintainers = with maintainers; [ teh thoughtpolice tscholak ]; # tscholak esp. for darwin-related builds
platforms = with platforms; linux ++ lib.optionals (!cudaSupport && !rocmSupport) darwin;
broken =
let
# A better variant of trivial.warnIf which takes a boolean, prints a diagnostic if the boolean is true, and returns the boolean.
# NOTE: A smarter person than I can figure out a way to use this as the predicate function instead of trivial.id.
# Look at all those `warnIf`, just waiting to be factored out! Unfortuantely, I couldn't find an easy way to curry.
# warnIf :: String -> Bool -> Bool
warnIf = message: condition: trivial.warnIf condition message condition;
# conditions :: AttrSet String Bool
conditions = {
"CUDA and ROCm are mutually exclusive" = cudaSupport && rocmSupport;
"CUDA is only supported on Linux" = cudaSupport && !stdenv.isLinux;
"Only CUDA 11 is currently supported" = cudaSupport && (cudaPackages.cudaMajorVersion != "11");
"MPI cudatoolkit does not match cudaPackages.cudatoolkit" = MPISupport && cudaSupport && (mpi.cudatoolkit != cudaPackages.cudatoolkit);
"Magma cudaPackages does not match cudaPackages" = cudaSupport && (magma.cudaPackages != cudaPackages);
};
# checked :: List Bool
checked = attrsets.mapAttrsToList warnIf conditions;
# isBroken :: Bool
isBroken = builtins.any trivial.id checked;
in
isBroken;
}I thought it would print warnings stating what the problem is and mark the derivation as broken. It does not. All we see is the usual I think it's due to the way Anyway I'll drop the messages and commit shortly. |
|
@SomeoneSerge when I tried {
changelog = "https://github.com/pytorch/pytorch/releases/tag/v${version}";
# keep PyTorch in the description so the package can be found under that name on search.nixos.org
description = "PyTorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration";
homepage = "https://pytorch.org/";
license = licenses.bsd3;
maintainers = with maintainers; [ teh thoughtpolice tscholak ]; # tscholak esp. for darwin-related builds
platforms = with platforms; linux ++ lib.optionals (!cudaSupport && !rocmSupport) darwin;
broken = builtins.any trivial.id [
# CUDA and ROCm are mutually exclusive
(cudaSupport && rocmSupport)
# CUDA is only supported on Linux
(cudaSupport && !stdenv.isLinux)
# Only CUDA 11 is currently supported
(cudaSupport && (cudaPackages.cudaMajorVersion != "11"))
# MPI cudatoolkit does not match cudaPackages.cudatoolkit
(MPISupport && cudaSupport && (mpi.cudatoolkit != cudaPackages.cudatoolkit))
# Magma cudaPackages does not match cudaPackages
(cudaSupport && (magma.cudaPackages != cudaPackages))
];
}and set changed magma-cuda-static = magma-cuda.override {
static = true;
cudaPackages = cudaPackages_12;
};the PyTorch derivation was correctly marked as broken. I'm not sure whether it's doing a deep comparison, but for me that's good enough for now. |
ea939b9 to
b0bd194
Compare
|
Amazing work, Connor!
It does, but I suspect this is an internal detail, and I think it's fast for diverging sets and slow (linear in # attrs) for identical sets. Regardless, I think that these ad hoc tests based on ad hoc passthru attributes aren't very useful, because they won't help noticing a potential new dependency propagating a diverging version of the toolkit. I think we want systematic tests that overlook the entire build environment/the whole closure, and one way to do this today is using setup hooks.
Actually, your code does work, only the error message is a bit hard to notice (the very first line of the output): |
|
That's odd, I'm not seeing it in my output: Details[connorbaker@nixos-desktop:~/nixpkgs]$ nom build --impure .#python3Packages.torch
warning: Git tree '/home/connorbaker/nixpkgs' is dirty
error:
… while evaluating the attribute 'drvPath'
at /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/lib/customisation.nix:222:7:
221| in commonAttrs // {
222| drvPath = assert condition; drv.drvPath;
| ^
223| outPath = assert condition; drv.outPath;
… while evaluating the attribute 'drvPath'
at /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/lib/customisation.nix:222:7:
221| in commonAttrs // {
222| drvPath = assert condition; drv.drvPath;
| ^
223| outPath = assert condition; drv.outPath;
(stack trace truncated; use '--show-trace' to show the full trace)
error: Package ‘python3.10-torch-2.0.1’ in /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/pkgs/development/python-modules/torch/default.nix:439 is marked as broken, refusing to evaluate.
a) To temporarily allow broken packages, you can use an environment variable
for a single invocation of the nix tools.
$ export NIXPKGS_ALLOW_BROKEN=1
Note: For `nix shell`, `nix build`, `nix develop` or any other Nix 2.4+
(Flake) command, `--impure` must be passed in order to read this
environment variable.
b) For `nixos-rebuild` you can set
{ nixpkgs.config.allowBroken = true; }
in configuration.nix to override this.
c) For `nix-env`, `nix-build`, `nix-shell` or any other Nix command you can add
{ allowBroken = true; }
to ~/.config/nixpkgs/config.nix.
┏━ 1 Errors:
┃ error:
┃ … while evaluating the attribute 'drvPath'
┃
┃ at /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/lib/customisation.nix:222:7:
┃
┃ 221| in commonAttrs // {
┃ 222| drvPath = assert condition; drv.drvPath;
┃ | ^
┃ 223| outPath = assert condition; drv.outPath;
┃
┃ … while evaluating the attribute 'drvPath'
┃
┃ at /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/lib/customisation.nix:222:7:
┃
┃ 221| in commonAttrs // {
┃ 222| drvPath = assert condition; drv.drvPath;
┃ | ^
┃ 223| outPath = assert condition; drv.outPath;
┃
┃ (stack trace truncated; use '--show-trace' to show the full trace)
┃
┃ error: Package ‘python3.10-torch-2.0.1’ in /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/pkgs/development/python-modules/torch/default.nix:439 is marke…
┃
┃ a) To temporarily allow broken packages, you can use an environment variable
┃ for a single invocation of the nix tools.
┃
┃ $ export NIXPKGS_ALLOW_BROKEN=1
┃
┃ Note: For `nix shell`, `nix build`, `nix develop` or any other Nix 2.4+
┃ (Flake) command, `--impure` must be passed in order to read this
┃ environment variable.
┃
┃ b) For `nixos-rebuild` you can set
┃ { nixpkgs.config.allowBroken = true; }
┃ in configuration.nix to override this.
┃
┃ c) For `nix-env`, `nix-build`, `nix-shell` or any other Nix command you can add
┃ { allowBroken = true; }
┃ to ~/.config/nixpkgs/config.nix.
┣━━━
┗━ ∑︎ ⚠︎ Exited with 1 errors reported by nix at 14:02:48 after 3sEDIT: 🤦♂️ It was [connorbaker@nixos-desktop:~/nixpkgs]$ nix build --impure .#python3Packages.torch
warning: Git tree '/home/connorbaker/nixpkgs' is dirty
trace: warning: Magma cudaPackages does not match cudaPackages
error:
… while evaluating the attribute 'drvPath'
at /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/lib/customisation.nix:222:7:
221| in commonAttrs // {
222| drvPath = assert condition; drv.drvPath;
| ^
223| outPath = assert condition; drv.outPath;
… while evaluating the attribute 'drvPath'
at /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/lib/customisation.nix:222:7:
221| in commonAttrs // {
222| drvPath = assert condition; drv.drvPath;
| ^
223| outPath = assert condition; drv.outPath;
(stack trace truncated; use '--show-trace' to show the full trace)
error: Package ‘python3.10-torch-2.0.1’ in /nix/store/mynv1irinq3kimslb75sm4kl88q5jah7-source/pkgs/development/python-modules/torch/default.nix:439 is marked as broken, refusing to evaluate.
a) To temporarily allow broken packages, you can use an environment variable
for a single invocation of the nix tools.
$ export NIXPKGS_ALLOW_BROKEN=1
Note: For `nix shell`, `nix build`, `nix develop` or any other Nix 2.4+
(Flake) command, `--impure` must be passed in order to read this
environment variable.
b) For `nixos-rebuild` you can set
{ nixpkgs.config.allowBroken = true; }
in configuration.nix to override this.
c) For `nix-env`, `nix-build`, `nix-shell` or any other Nix command you can add
{ allowBroken = true; }
to ~/.config/nixpkgs/config.nix. |
| # For more, see https://github.com/open-mpi/ompi/issues/7733#issuecomment-629806195. | ||
| preConfigure = lib.optionalString cudaSupport '' | ||
| export TORCH_CUDA_ARCH_LIST="${gpuTargetString}" | ||
| export CC=${cudatoolkit.cc}/bin/gcc CXX=${cudatoolkit.cc}/bin/g++ |
There was a problem hiding this comment.
Do we refer to backendStdenv.cc somehow else now, or did we just drop it? In the latter case, I suspect things are going to break again next time nixpkgs bumps glibc
…URCE_DIR Limitation discovered in NixOS/nixpkgs#249259.
Description of changes
Carrying on the torch lit by #168745.
Numbers!
See Full details for setup and full output of
nix path-info -rsSh.Before PR (baseline):
After PR:
Full numbers
Shared
~/.config/nixpkgs/config.nixacross both:Baseline values:
PR values:
Sadly, looks like no reduction in build times. For that, we'll likely need something like #239291.
Here's my favorite part though -- if we were to throw this in a Nix binary cache using ZSTD with compression level 19:
It compresses down to 1.3G!
For reference, the download size of the latest stable copy of the equivalent PyTorch environment from Anaconda: 3GB.
Although, to be fair, the Anaconda distribution does include triton, which we package separately.
Here's how I got that number.
Unrelated, but I cannot recommend micromamba highly enough. It's very fast.
Things done
sandbox = trueset innix.conf? (See Nix manual)nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)