Skip to content

cudaPackages: build redists from manifests and add CUDA 13#437723

Merged
ConnorBaker merged 99 commits intoNixOS:masterfrom
ConnorBaker:feat/cuda-packaging-refactor
Oct 27, 2025
Merged

cudaPackages: build redists from manifests and add CUDA 13#437723
ConnorBaker merged 99 commits intoNixOS:masterfrom
ConnorBaker:feat/cuda-packaging-refactor

Conversation

@ConnorBaker
Copy link
Contributor

@ConnorBaker ConnorBaker commented Aug 28, 2025

Important

The addition of CUDA 13 does not mean packages will suddenly work with CUDA 13. Expect breakages.

Largely based on work done in https://github.com/ConnorBaker/cuda-packages.

Big changes:

  • CUDA 13 support
  • Introduced manifests for many redistributables in _cuda.manifests
    • cublasmp
    • cuda
    • cudnn
    • cudss
    • cuquantum
    • cusolvermp
    • cusparselt
    • cutensor
    • nppplus
    • nvcomp
    • nvjpeg2000
    • nvpl
    • nvshmem
    • nvtiff
    • tensorrt (NOTE: TensorRT doesn't have manifests but there is documentation and a helper script for generating them)
  • Turned _cuda.fixups into full package expressions which exist as files, in-tree
  • Introduction of additional NVIDIA licenses in _cuda.lib.licenses
  • Updated _cuda.lib.allowUnfreeCudaPredicate to use _cuda.lib.licenses
  • Updated CUDA docs showing how to use _cuda.lib.allowUnfreeCudaPredicate by having config take an attribute set containing pkgs as an argument
  • Enabling config.cudaSupport automatically adds NVIDIA's licenses to config.allowlistedLicenses, avoiding the need to set config.allowUnfree
  • Module system evaluation as part of CUDA package set creation has been removed
  • Consistent attribute sets across CUDA package set versions as a result of using package expressions for redistributables
  • Removal of "feature" manifests (this information is now encoded directly in redistributable package expressions)
  • Introduced buildRedist helper function
  • Removed usage of overrideAttrs
  • Introduced tests.redists-unpacked and tests.redists-installed to verify downloading all sources and unpacking and installing all redistributables work as intended
  • _cuda.lib.{_mkMetaBroken,_mkMetaBadPlatforms} have been updated to use builtins.traceVerbose to avoid polluting output when doing whole-tree evaluations (e.g., during nixpkgs-review runs) and now take one argument instead of two
  • Introduced cudaPackages.tests.{cmake,cudnn-frontend,onnx-tensorrt}

In the long run, I absolutely think that a database-backed approach like #406740 is the right way to handle package creation and generation. Unfortunately, it's not quite ready. In the interim, an expression per-redistributable package enables easily finding the relevant file in-tree (no flattening of deeply nested JSON), consistent package set members across versions (instead of adding attributes on the fly for supported systems), and avoiding runtime generation of package expressions.

Things done

  • Built on platform:
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • Tested, as applicable:
  • Ran nixpkgs-review on this PR. See nixpkgs-review usage.
  • Tested basic functionality of all binary files, usually in ./result/bin/.
  • Nixpkgs Release Notes
    • Package update: when the change is major or breaking.
  • NixOS Release Notes
    • Module addition: when adding a new NixOS module.
    • Module update: when the change is significant.
  • Fits CONTRIBUTING.md, pkgs/README.md, maintainers/README.md and other READMEs.

As always with nixpkgs-review, part of the problem is the unfree packages aren't built or cached and so they're typically rebuilt. Using allowUnfree enables many more packages to be built than just the CUDA packages.


Add a 👍 reaction to pull requests you find important.

@ConnorBaker ConnorBaker changed the title wip cudaPackages: refactor packing to build from manifests Aug 28, 2025
@ConnorBaker ConnorBaker self-assigned this Aug 28, 2025
@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Aug 28, 2025
@ConnorBaker ConnorBaker moved this from New to 🏗 In progress in CUDA Team Aug 28, 2025
@nixpkgs-ci nixpkgs-ci bot added 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-darwin: 501+ This PR causes many rebuilds on Darwin and should normally target the staging branches. 10.rebuild-linux: 501-1000 This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-darwin: 501-1000 This PR causes many rebuilds on Darwin and should normally target the staging branches. labels Aug 28, 2025
@ConnorBaker ConnorBaker mentioned this pull request Aug 28, 2025
13 tasks
@nixpkgs-ci nixpkgs-ci bot added the 6.topic: python Python is a high-level, general-purpose programming language. label Aug 28, 2025
@ConnorBaker ConnorBaker mentioned this pull request Aug 28, 2025
13 tasks
@nixpkgs-ci nixpkgs-ci bot added 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-darwin: 1 This PR causes 1 package to rebuild on Darwin. and removed 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-darwin: 501+ This PR causes many rebuilds on Darwin and should normally target the staging branches. 10.rebuild-linux: 501-1000 This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-darwin: 501-1000 This PR causes many rebuilds on Darwin and should normally target the staging branches. labels Aug 28, 2025
@ConnorBaker ConnorBaker force-pushed the feat/cuda-packaging-refactor branch from b6dcba3 to c49efbc Compare August 28, 2025 23:20
@cpcloud
Copy link
Contributor

cpcloud commented Aug 29, 2025

I can't seem to build anything in this PR without allowing broken packages. I have cudaSupport enabled system-wide, but I keep getting a warning about an assertion failing because of needing cudaSupport to be enabled.

@ConnorBaker ConnorBaker force-pushed the feat/cuda-packaging-refactor branch from 8569221 to f0ec09c Compare August 29, 2025 21:20
@ConnorBaker
Copy link
Contributor Author

What do you mean by "system-wide"?

@ConnorBaker ConnorBaker force-pushed the feat/cuda-packaging-refactor branch from f0ec09c to 217ba0f Compare August 29, 2025 21:56
@cpcloud
Copy link
Contributor

cpcloud commented Aug 29, 2025

I've set config.cudaSupport to true in my flake's nixpkgs instance.

@ConnorBaker
Copy link
Contributor Author

Do you have an example flake you can post? Also, have you set config.allowUnfree to true? See https://nixos.org/manual/nixpkgs/stable/#cuda-configuring-nixpkgs-for-cuda for an example (though it uses the more specific allowUnfreePredicate).

With this PR, I can do

NIXPKGS_CONFIG=~/.config/nixpkgs/config-sm89.nix nix build --impure .#cudaPackages.saxpy

where ~/.config/nixpkgs/config-sm89.nix is

{
  allowUnfree = true;
  cudaCapabilities = [ "8.9" ];
  cudaSupport = true;
}

@ConnorBaker ConnorBaker force-pushed the feat/cuda-packaging-refactor branch 2 times, most recently from 3175ff8 to 78e5800 Compare August 31, 2025 16:02
@nixpkgs-ci nixpkgs-ci bot removed the 10.rebuild-darwin: 1 This PR causes 1 package to rebuild on Darwin. label Aug 31, 2025
@emilazy
Copy link
Member

emilazy commented Oct 31, 2025

I’m worried about the number of non‐trivial breaking changes that seem to have been caused by this PR merged ten days after the freeze, that have led to proposals for further breaking changes like #457120 or last‐minute implementations of complex features in core Nixpkgs components like #456908. I think we should reconsider shipping a change this drastic at this stage in the release unless the regressions can be fixed in a more simple manner. cc @leona-ya @jopejoe1

Edit: Actually #457120 isn’t a breaking change in itself, just a way to work around the config breaking change introduced here (at the expense of compatibility across stable and unstable), so I shouldn’t double‐count it.

@ConnorBaker
Copy link
Contributor Author

@emilazy have you considered #457038?

While I would like to have #456908, I don’t view it as a release blocker.

Comment on lines -67 to +68
allowlist = config.allowlistedLicenses or config.whitelistedLicenses or [ ];
blocklist = config.blocklistedLicenses or config.blacklistedLicenses or [ ];
allowlist = config.allowlistedLicenses;
blocklist = config.blocklistedLicenses;
Copy link
Member

@emilazy emilazy Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, right? Is there any error message for someone who previously set config.blacklistedLicenses? I don’t mind removing it with a throw, but it seems like this will silently start permitting licences a user was specifically trying to opt out of.

To be honest I am not sure why release managers were not pinged in advance for a 99‐commit rework PR with many breaking changes merged this late after the freeze…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats indeed breaking, the more i look at it the more i think we should just revert it, and reintroduce it after branch of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reintroduce it after branch off

With an mkRenamedOption for {black,white}listedLicenses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, I should have used mkRenamedOption. I did not do so because they are undocumented attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any error message for someone who previously set config.blacklistedLicenses? I don’t mind removing it with a throw, but it seems like this will silently start permitting licences a user was specifically trying to opt out of.

Since this possibly has legal consequences for those using this option, this should be dealt with ASAP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by reverting in #457338.

@jopejoe1
Copy link
Member

This indeed looks to me like a pr that should not have been merged this late into the release cycle.
Also feels a bit under rewiewd for a pr of this size and in my opinion should have benn splited into some smaller pr's for easier review.

@SomeoneSerge
Copy link
Contributor

  1. The safe and conservative path forward here is to revert changes to check-meta and drop these three lines:

    {
    message = "CUDA support is enabled by config.cudaSupport";
    assertion = config.cudaSupport;
    }
    .

  2. After reflection, I'm inclined to support sneaking in the "global cuda or no cuda" change, adjusting the remediation message check-meta: custom remediation messages #456908, and reverting the allowlist change.

For the rest of the PR, my review is still in progress, but I feel fairly confident there are no other "breakers" or other impact on other parts of Nixpkgs. The CUDA-specific change is somewhat inherently massive, due to schema updates upstream and due to our tech debt. OTOH it was necessitated by the python stuff. Splitting up into smaller PRs would've been possible, but surely at the cost of another burnout. I suggest that we choose from options 1 or 2

@ConnorBaker
Copy link
Contributor Author

Right now I don't really care any more, sorry for any trouble this caused. In particular I should have been more careful with the changes to config.

@SomeoneSerge you have my blessing to revert or change whatever you see fit. I see fundamental issues with the way several things are architected in Nixpkgs and have been carrying out the fool's errand of trying to fix them in a localized manner to avoid the absurd cost of addressing the root problem. I apologize for the work that created for you.

@SomeoneSerge
Copy link
Contributor

@ConnorBaker, I haven't been making it easier for you lately either, and didn't keep myself from being annoying in the few reviews that I sent. I can't quite put my finger at why, but I find just not being toxic is already hard enough.

fundamental issues with the way several things are architected in Nixpkgs

We know, we all do...

root problem

If I had to name just one, I couldn't.

I hope you can soon care again

@me-and
Copy link
Member

me-and commented Oct 31, 2025

Actually #457120 isn’t a breaking change in itself, just a way to work around the config breaking change introduced here (at the expense of compatibility across stable and unstable), so I shouldn’t double‐count it.

@emilazy In case it helps at all, I've just updated the release note for that PR to suggest a change to users' configuration that allows the same configuration to work in both stable and unstable.

For my particular concern, #456994, I don't have any particular preference between backing out the relevant changes from this PR and revisiting them in 26.05, taking #457038 as a safe fix and considering #457120 in 26.05, or taking #457120 now. I'll leave that decision to folks much better versed in the release processes and cycles and preferences!

@wolfgangwalther
Copy link
Contributor

Collecting everything that was reported as a breaking change or negative impact:

This is a huge PR: 99 commits, 226 files changed, 9k lines changed.

It was merged without approvals, while review was still on going. It was merged 10 days after the feature freeze.

The main addition is CUDA 13, but the description mentions:

The addition of CUDA 13 does not mean packages will suddenly work with CUDA 13. Expect breakages.

That means that CUDA 13 will possibly stay unusable on the stable release? I guess the CUDA 13 part is not required at this time.

Everything else looks mostly like refactoring. I assume the biggest problem down the road will be, that this massive changeset will make backports near impossible, if this thing is only merged after branch-off?

This was not given as a reason, though, the reason for merge was:

we need to get this in so we can iterate on it

Frankly, merging this massive thing at once is not iterating. Splitting this up, especially the changes to non-cuda packages (by-name etc.), looking at the impact for much smaller parts of this is the right way to do it, I think.


Unfortunately, due to the sheer size and me being entirely unfamiliar with all the CUDA packaging, I can't judge whether it's reasonable to fix this quickly.

The sensible option from my perspective is to revert now and reapply after branch-off. The longer we wait with the revert, the more likely we are going to have to deal with conflicts on the revert itself.

@SomeoneSerge
Copy link
Contributor

@wolfgangwalther I can later prepare a more detailed post-mortem explaining the context behind the PR, and in particular the back'n'forth we've been having in the weekly meetings which complement the review process, but my bottom line is: the process was justified-enough for CUDA-internal changes, the other changes are SciComp leaf packages and the changes to check-meta.nix and top-level/config.nix. The first two categories do not or are not meant to interfere with the release process. The last category is what I suggest we revert in the "conservative" path in #437723

@ConnorBaker
Copy link
Contributor Author

The @NixOS/cuda-maintainers have decided to revert the changes surrounding config:

@emilazy @wolfgangwalther @jopejoe1 if you have any questions or concerns about the CUDA packaging, we’re more than happy to schedule time to hear them!

@wolfgangwalther
Copy link
Contributor

Thank you!

@trofi
Copy link
Contributor

trofi commented Nov 5, 2025

I suspect 7122833 caused this minor eval of an obscure attribute:

$ nix-instantiate -A cudaPackages.tests.onnx-tensorrt.long
error:
       … while calling the 'abort' builtin
         at /home/slyfox/dev/git/nixpkgs-master/lib/customisation.nix:323:7:
          322|     else
          323|       abort "lib.customisation.callPackageWith: ${error}";
             |       ^
          324|

       error: evaluation aborted with the following error message: 'lib.customisation.callPackageWith: Function called without required argument "onnx-tensorrt" at /home/slyfox/dev/git/nixpkgs-master/pkgs/development/cuda-modules/packages/tests/onnx-tensorrt/long.nix:4'

@SomeoneSerge SomeoneSerge mentioned this pull request Nov 10, 2025
13 tasks
@SuperSandro2000
Copy link
Member

I could not find any reference to onnx-tensorrt 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: cuda Parallel computing platform and API 6.topic: python Python is a high-level, general-purpose programming language. 6.topic: stdenv Standard environment 8.has: documentation This PR adds or changes documentation 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux.

Projects

Status: ✅ Done
Status: Done

Development

Successfully merging this pull request may close these issues.