Skip to content

cudaPackages.buildRedist: automatically remove runpath entries for stubs#459416

Merged
SomeoneSerge merged 11 commits intoNixOS:masterfrom
ConnorBaker:feat/cuda-stub-setup-hook
Nov 26, 2025
Merged

cudaPackages.buildRedist: automatically remove runpath entries for stubs#459416
SomeoneSerge merged 11 commits intoNixOS:masterfrom
ConnorBaker:feat/cuda-stub-setup-hook

Conversation

@ConnorBaker
Copy link
Contributor

@ConnorBaker ConnorBaker commented Nov 7, 2025

Sometimes we have to link against stubs because the actual libraries are only available at runtime (like those provided by NVIDIA's drivers). Leaving the stubs in the RUNPATH is a mistake because the loader will search them and load the stubs rather than the real libraries provided at runtime if the stubs are discovered first.

This PR introduces a hook which replaces stub entries with a reference to /run/opengl-driver/lib. It is written such that at most one reference to the driver link lib directory is added if stub entries are found; the hook does not attempt to deduplicate existing entries; it may introduce a duplicate driver link lib directory reference if a stub entry occurs before an existing driver link lib directory reference.

CUDA sample which links against the stub before:

$ patchelf --print-rpath ./result/0_Introduction/matrixMul_nvrtc/matrixMul_nvrtc 
/nix/store/61cdmzcp5y82aapa1dg3d403rxpqi8h8-cuda12.8-cuda_nvrtc-12.8.93-lib/lib:/nix/store/daamdpmaz2vjvna55ccrc30qw3qb8h6d-glibc-2.40-66/lib:/nix/store/z7a34j3xnp66rpddayyxrxwsahxccbip-gcc-14.3.0-lib/lib:/nix/store/60bccal8rk5zm3nsxszvfvv6754imwcl-cuda12.8-cuda_cudart-12.8.90/lib/stubs

and after:

$ patchelf --print-rpath ./result/0_Introduction/matrixMul_nvrtc/matrixMul_nvrtc 
/nix/store/8d75zh50j3j0yv20pghnh6p67fvjf60j-cuda12.8-cuda_nvrtc-12.8.93-lib/lib:/nix/store/daamdpmaz2vjvna55ccrc30qw3qb8h6d-glibc-2.40-66/lib:/nix/store/z7a34j3xnp66rpddayyxrxwsahxccbip-gcc-14.3.0-lib/lib:/run/opengl-driver/lib

This hook is automatically enabled for all redistributables with a stubs output and can be manually enabled or disabled by providing buildRedist the includeRemoveStubsFromRunpathHook boolean.

With these changes, I'm also able to build UCX and UCC with

{
  # We cannot disallow cuda_cudart's stubs since they are in the same output as the libs.
  disallowedReferences = lib.optionals enableCuda [
    (lib.getOutput "stubs" cudaPackages.cuda_nvml_dev)
  ];
}

Things done

  • Built on platform:
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • Tested, as applicable:
  • Ran nixpkgs-review on this PR. See nixpkgs-review usage.
  • Tested basic functionality of all binary files, usually in ./result/bin/.
  • Nixpkgs Release Notes
    • Package update: when the change is major or breaking.
  • NixOS Release Notes
    • Module addition: when adding a new NixOS module.
    • Module update: when the change is significant.
  • Fits CONTRIBUTING.md, pkgs/README.md, maintainers/README.md and other READMEs.

Add a 👍 reaction to pull requests you find important.

@ConnorBaker ConnorBaker self-assigned this Nov 7, 2025
@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Nov 7, 2025
@nixpkgs-ci nixpkgs-ci bot added 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 11.by: package-maintainer This PR was created by a maintainer of all the package it changes. labels Nov 7, 2025
@daniel-fahey

This comment was marked as resolved.

Copy link
Contributor

@daniel-fahey daniel-fahey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks nice running on my Mk1 eyeballs, I did notice your directory and file names don't follow Nixpkgs convention to use kebab-case? (No idea where there that is, if at all, documented. Seems to be the convention inside pkgs/build-support.)

@ConnorBaker ConnorBaker force-pushed the feat/cuda-stub-setup-hook branch from 1a9ea71 to 0763bfc Compare November 7, 2025 23:33
@SomeoneSerge

This comment was marked as outdated.

@SomeoneSerge
Copy link
Contributor

This will mean users don't have to set LD_LIBRARY_PATH before running e.g. vLLM like

Why? Does vllm suffer from the hook ordering issue, ordering stubs left of driverLink in DT_RUNPATH?

@daniel-fahey
Copy link
Contributor

daniel-fahey commented Nov 10, 2025

This will mean users don't have to set LD_LIBRARY_PATH before running e.g. vLLM like

Why? Does vllm suffer from the hook ordering issue, ordering stubs left of driverLink in DT_RUNPATH?

It doesn't. I was shooting from the hip. vLLM doesn't suffer from the hook ordering issue. The vLLM derivation uses autoAddDriverRunpath and all of its C/C++ extensions have /run/opengl-driver/lib as the first entry in their RUNPATHs 😳. Looks like the vLLM gotcha is due to it using ctypes. Thanks for pointing me in the right direction.

@ConnorBaker
Copy link
Contributor Author

Mirroring from matrix: my preliminary assessment is that keeping stub paths is rather desirable than undesirable

Why? All I can think of is the error I (sometimes) get when loading the stub library that I’ve loaded a stub library. Other than that, having additional RUNPATH entries means more lookups while searching for other libraries, which I don’t enjoy.

... and that controlling the hook order (or commutativity & associativity & idempotence of hook execution) was the right approach

I absolutely agree that hooks should have those properties wherever possible (so nearly all of them).

But that’s a very heavy lift, which is why I looked into hook re-ordering. That’s the origin of my arrayUtilities PRs (#385960) -- I saw I needed this logic for array handling in a bunch of places and wanted to have it in a single place to reduce the likelihood I messed it up.

Unfortunately, not only was that PR not accepted as it was, requiring me to remove most of the functionality I needed from it, I also ran into an old bug (I tried to resolve it in #388767) which meant that I couldn’t use any of those utility functions (which are provided by setup hooks) until after phases started running. As such, I had to re-order hooks in later phases within a very early phase, which feels incredibly brittle and also depends on things like hook name not changing.

And as I was doing that, I encountered another problem: sometimes phases are just strings, not arrays! So I’d need to be very, very careful with how I parsed and interacted with them. I generally just don’t have the patience for that kind of surgery when we absolutely should be using structured attributes and arrays everywhere, full stop.

While I agree hooks should have these properties, there’s so much in the way of a clean, principled implementation.

Can we just remove the stub entries for now and make an issue to fix everything else later?

@ConnorBaker ConnorBaker marked this pull request as ready for review November 12, 2025 19:06
@SomeoneSerge
Copy link
Contributor

To future readers: Connor notified us on #cuda:nixos.org matrix that he's unlikely to be available any time soon, due to dealing with the grimmer aspects of life.

All I can think of is the error I (sometimes) get when loading the stub library

Yeah we just need to rewrite the message 😂 Well, to ask for permission to rewrite the message

But that’s a very heavy lift, which is why I looked into hook re-ordering. That’s the origin of my arrayUtilities PRs [...] While I agree hooks should have these properties, there’s so much in the way of a clean, principled implementation.

I know, I've been there: https://github.com/NixOS/nixpkgs/pull/297590/files#diff-ec85191a2f968fa96c76a95ed2755151e530eebeb9852acb6a96e07e6eee9f81R10-R36

Can we just remove the stub entries for now and make an issue to fix everything else later?

If we do, I'd probably go with modifying mkDerivation with disallowReferences and removeReferencesTo. I guess let's ponder next time we chat @GaetanLepage ?

@SomeoneSerge
Copy link
Contributor

Turns out the error messages aren't even defined in the stubs, but in e.g. cudart instead, so all my previous arguments are irrelevant

@ConnorBaker ConnorBaker force-pushed the feat/cuda-stub-setup-hook branch from 0763bfc to 0896087 Compare November 24, 2025 03:10
@nixpkgs-ci nixpkgs-ci bot added 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux. and removed 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. labels Nov 24, 2025
@ConnorBaker ConnorBaker force-pushed the feat/cuda-stub-setup-hook branch from 0896087 to 3721b48 Compare November 24, 2025 03:47
@ConnorBaker ConnorBaker force-pushed the feat/cuda-stub-setup-hook branch from e5f409d to 9d38d18 Compare November 25, 2025 01:22
@ConnorBaker ConnorBaker moved this from 📋 The forgotten to 👀 Awaits reviews in CUDA Team Nov 25, 2025
Copy link
Contributor

@SomeoneSerge SomeoneSerge Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: In future we probably want to replace this with a more general patchelf-structuredAttrs adapter, although patchelf is far from the only tool that suffers from this issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed... :(

…fsets

Signed-off-by: Connor Baker <ConnorBaker01@gmail.com>
…d-build-inputs

Signed-off-by: Connor Baker <ConnorBaker01@gmail.com>
Signed-off-by: Connor Baker <ConnorBaker01@gmail.com>
Copy link
Contributor

@SomeoneSerge SomeoneSerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the out of date comment, ready?

local runpath
# Files that are not dynamically linked cause patchelf to exit with a non-zero status and print to stderr.
# If patchelf fails to print the rpath, we assume the file is not dynamically linked.
runpath="$(patchelf --print-rpath "$path" 2>/dev/null)" || return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/me wonders if stdenv sets -o pipefail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup.sh sets -euo pipefail when it begins and then at the end of being sourced restores e and u. (Interestingly, it does not unset pipefail!)

But no, unless someone sets it explicitly in a phase they're not set.

…athHook

Signed-off-by: Connor Baker <ConnorBaker01@gmail.com>
Signed-off-by: Connor Baker <ConnorBaker01@gmail.com>
…thHook

Signed-off-by: Connor Baker <ConnorBaker01@gmail.com>
@SomeoneSerge SomeoneSerge added this pull request to the merge queue Nov 26, 2025
Merged via the queue into NixOS:master with commit 3be2630 Nov 26, 2025
31 of 33 checks passed
@SomeoneSerge SomeoneSerge deleted the feat/cuda-stub-setup-hook branch November 26, 2025 08:25
@github-project-automation github-project-automation bot moved this from 👀 Awaits reviews to ✅ Done in CUDA Team Nov 26, 2025
@GaetanLepage
Copy link
Contributor

Good job guys!

+ optionalString finalAttrs.includeRemoveStubsFromRunpathHook ''
nixLog "installing stub removal runpath hook"
mkdir -p "''${!outputStubs:?}/nix-support"
printWords >>"''${!outputStubs:?}/nix-support/propagated-build-inputs" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to propagate this to runtime? Why no dev output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn github email integration broke again, I had replied to this days ago but the message doesn't appear. TLDR: stubs is a dev output

@SuperSandro2000
Copy link
Member

I really want to get rid of patchelf from my final system as that is usually a sign that somewhere something gets propagated too much.

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Nov 30, 2025 via email

@ConnorBaker
Copy link
Contributor Author

@SuperSandro2000 there are a few ways to make that happen:

  1. Remove patchelf as a dependency from arrayUtilities.getRunpathEntries, which will break its usage in any environment which does not provide patchelf.
  2. Ensure all CUDA redists which have stubs have a stubs output and do not propagate them by default. This would ensure that the dependency on the stub removal hook is only pulled in when the stubs are explicitly requested. I may have to do that for an unrelated thing.

@SuperSandro2000
Copy link
Member

The second option sounds like the best choice here

@SomeoneSerge
Copy link
Contributor

Ensure all CUDA redists which have stubs have a stubs output and do not propagate them by default.

s/do not propagate by default/only propagate them in dev, not in out/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: cuda Parallel computing platform and API 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux. 11.by: package-maintainer This PR was created by a maintainer of all the package it changes.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

5 participants