Skip to content

python3Packages.torch: gate NCCL with cudaSupport#273594

Merged
ConnorBaker merged 1 commit intoNixOS:masterfrom
ConnorBaker:fix/pytorch-fix-gate-nccl-with-cudaSupport
Dec 11, 2023
Merged

python3Packages.torch: gate NCCL with cudaSupport#273594
ConnorBaker merged 1 commit intoNixOS:masterfrom
ConnorBaker:fix/pytorch-fix-gate-nccl-with-cudaSupport

Conversation

@ConnorBaker
Copy link
Contributor

@ConnorBaker ConnorBaker commented Dec 11, 2023

Description of changes

This fixes an evaluation error on systems without CUDA support. Originally discovered in #256324 (comment).

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@ConnorBaker ConnorBaker added 6.topic: darwin Running or building packages on Darwin 6.topic: cuda Parallel computing platform and API backport release-23.11 labels Dec 11, 2023
@ConnorBaker ConnorBaker self-assigned this Dec 11, 2023
@github-actions github-actions bot added the 6.topic: python Python is a high-level, general-purpose programming language. label Dec 11, 2023
@ConnorBaker ConnorBaker requested review from Madouura, SomeoneSerge and natsukium and removed request for SomeoneSerge December 11, 2023 17:04
@ConnorBaker
Copy link
Contributor Author

@natsukium this should fix the error you were running into -- I'm able to eval now on my MacBook. Can you confirm the same?

@ConnorBaker
Copy link
Contributor Author

ConnorBaker commented Dec 11, 2023

Note

Template nixpkgs-review command:

PR=273594; \
SYSTEM="aarch64-linux"; \
CUDA_SUPPORT="true"; \
CUDA_CAPABILITIES='[ "7.5" ]'; \
nixpkgs-review pr "$PR" \
  --system "$SYSTEM" \
  --no-shell \
  --checkout commit \
  --allow aliases \
  --build-args "--max-jobs 1" \
  --extra-nixpkgs-config "{
    allowUnfree = true;
    allowBroken = false;
    cudaSupport = ${CUDA_SUPPORT:-false};
    cudaCapabilities = ${CUDA_CAPABILITIES:-[]};
  }"

aarch64-darwin

In progress.

x86_64-darwin

In progress.

aarch64-linux

Jetson

Result of nixpkgs-review pr 273594 --extra-nixpkgs-config '{ allowUnfree = true; allowBroken = false; cudaSupport = true; cudaCapabilities = [ "7.2" ]; }' run on aarch64-linux 1

Non-Jetson

Result of nixpkgs-review pr 273594 --extra-nixpkgs-config '{ allowUnfree = true; allowBroken = false; cudaSupport = true; cudaCapabilities = [ "7.5" ]; }' run on aarch64-linux 1

10 packages built:
  • python310Packages.torchWithoutCuda (python310Packages.pytorchWithoutCuda)
  • python310Packages.torchWithoutCuda.cxxdev (python310Packages.pytorchWithoutCuda.cxxdev)
  • python310Packages.torchWithoutCuda.dev (python310Packages.pytorchWithoutCuda.dev)
  • python310Packages.torchWithoutCuda.dist (python310Packages.pytorchWithoutCuda.dist)
  • python310Packages.torchWithoutCuda.lib (python310Packages.pytorchWithoutCuda.lib)
  • python311Packages.torchWithoutCuda (python311Packages.pytorchWithoutCuda)
  • python311Packages.torchWithoutCuda.cxxdev (python311Packages.pytorchWithoutCuda.cxxdev)
  • python311Packages.torchWithoutCuda.dev (python311Packages.pytorchWithoutCuda.dev)
  • python311Packages.torchWithoutCuda.dist (python311Packages.pytorchWithoutCuda.dist)
  • python311Packages.torchWithoutCuda.lib (python311Packages.pytorchWithoutCuda.lib)

Non-CUDA

In progress.

x86_64-linux

Non-Jetson

In progress.

Non-CUDA

In progress.

Copy link
Contributor

@SomeoneSerge SomeoneSerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to wait for nixpkgs-review to merge this:

❯ nix eval github:ConnorBaker/nixpkgs/fix/pytorch-fix-gate-nccl-with-cudaSupport#legacyPackages.x86_64-darwin.python3Packages.torch.outPath
"/nix/store/qwsayvz14v0x7f06g2hmd6hsdssfgbrz-python3.11-torch-2.1.1"
❯ nix eval github:ConnorBaker/nixpkgs/fix/pytorch-fix-gate-nccl-with-cudaSupport#legacyPackages.aarch64-darwin.python3Packages.torch.outPath
"/nix/store/9g147agjr26wdc2g1a6b1c1477vz5qy0-python3.11-torch-2.1.1"

Obviously the deeper issues with cudaPackages evaluation would still have to be diagnosed and addressed

@delroth delroth added the 12.approvals: 1 This PR was reviewed and approved by one person. label Dec 11, 2023
@ConnorBaker ConnorBaker merged commit 3c1873e into NixOS:master Dec 11, 2023
@ConnorBaker ConnorBaker deleted the fix/pytorch-fix-gate-nccl-with-cudaSupport branch December 11, 2023 20:24
@github-actions
Copy link
Contributor

Successfully created backport PR for release-23.11:

@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Dec 11, 2023
@ofborg ofborg bot requested review from teh, thoughtpolice and tscholak December 11, 2023 20:36
@ofborg ofborg bot added 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux. labels Dec 11, 2023
@github-actions
Copy link
Contributor

Git push to origin failed for release-23.11 with exitcode 1

2 similar comments
@github-actions
Copy link
Contributor

Git push to origin failed for release-23.11 with exitcode 1

@github-actions
Copy link
Contributor

Git push to origin failed for release-23.11 with exitcode 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: cuda Parallel computing platform and API 6.topic: darwin Running or building packages on Darwin 6.topic: python Python is a high-level, general-purpose programming language. 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux. 12.approvals: 1 This PR was reviewed and approved by one person.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants