Skip to content

cudaPackages: fix regression, propagating too much#457424

Merged
GaetanLepage merged 1 commit intoNixOS:masterfrom
SomeoneSerge:fix/cuda13/firefox
Nov 3, 2025
Merged

cudaPackages: fix regression, propagating too much#457424
GaetanLepage merged 1 commit intoNixOS:masterfrom
SomeoneSerge:fix/cuda13/firefox

Conversation

@SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Nov 1, 2025

Quite likely fixes #457391 #457406, but I'm waiting for the tests to make sure. Even if it does fix, I'm still analyzing how this could have lead to the extraneous reference in onnxrutime.

Regardless of the bearing on onnxruntime and firefox, this is the pre-refactoring logic that probably shouldn't have been changed (we've been propagating strictly less, and that seemed to have been sufficient).

Things done

  • Built on platform:
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • Tested, as applicable:
  • Ran nixpkgs-review on this PR. See nixpkgs-review usage.
  • Tested basic functionality of all binary files, usually in ./result/bin/.
  • Nixpkgs Release Notes
    • Package update: when the change is major or breaking.
  • NixOS Release Notes
    • Module addition: when adding a new NixOS module.
    • Module update: when the change is significant.
  • Fits CONTRIBUTING.md, pkgs/README.md, maintainers/README.md and other READMEs.

Add a 👍 reaction to pull requests you find important.

@Janrupf
Copy link
Contributor

Janrupf commented Nov 1, 2025

Even if it does fix, I'm still analyzing how this could have lead to the extraneous reference in onnxrutime.

Not sure if you have seen that I edited my comment, this is my guess as to how they ended up there

@nixpkgs-ci nixpkgs-ci bot added 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux. 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 11.by: package-maintainer This PR was created by a maintainer of all the package it changes. 6.topic: cuda Parallel computing platform and API labels Nov 1, 2025
@SomeoneSerge SomeoneSerge changed the title cudaPackages: (Inshallah!) fix regression, propagating too much cudaPackages: fix regression, propagating too much Nov 1, 2025
@SomeoneSerge
Copy link
Contributor Author

Good news: with this build, I see no reference to nvcc in libonnxruntime_providers_cuda.so.
Bad news: even with this build, there's a requisite onnxruntime -> cudart -> nvcc, and I think the last edge didn't used to be there either

@Janrupf
Copy link
Contributor

Janrupf commented Nov 1, 2025

What exactly were you grepping for in libonnxruntime_providers_cuda.so? Keep in mind that not the full store path was in there, but only a prefix (grepping for the hash part of nvcc did yield it for me, so ygd3s9zm1pf77n3q3ac63v58www5scbc - not the full path and without the name). I'd be really confused if this change removed the reference magically from the binary, because it was embedded in the .nv_fatbin section in the ELF. Context was ! -L /nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-9

@SomeoneSerge
Copy link
Contributor Author

@Janrupf for the hash part:)

$ onnxruntime=$(nix build --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/fix/cuda13/firefox --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.9" ]; allowUnfree = true; }' onnxruntime --print-out-paths)
$ nix-store --query --references "$onnxruntime" | grep nvcc
$ nix-store --query --requisites "$onnxruntime" | grep nvcc
/nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-cuda12.8-cuda_nvcc-12.8.93

@Janrupf
Copy link
Contributor

Janrupf commented Nov 1, 2025

Ok now thats weird:

❯ nix why-depends --all --precise /nix/store/9bwh9xx4y08fw2384jkz3s3ai3fzg0np-onnxruntime-1.22.2 /nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-cuda12.8-cuda_nvcc-12.8.93
/nix/store/9bwh9xx4y08fw2384jkz3s3ai3fzg0np-onnxruntime-1.22.2
├───lib/libonnxruntime_providers_cuda.so: …,rt,@.3dev......! -L /nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-9.......c_nvcc-..../bin/..//lib…
│   → /nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-cuda12.8-cuda_nvcc-12.8.93
├───lib/libonnxruntime_providers_cuda.so: …nn-9.13.0.50-lib/lib:/nix/store/80x699lyc99dahf85iqdv6z1f0vv6vz2-cuda12.8-cuda_cudart-12.8.90/li…
│   → /nix/store/80x699lyc99dahf85iqdv6z1f0vv6vz2-cuda12.8-cuda_cudart-12.8.90
│   └───nix-support/propagated-build-inputs: …fhjm-setup-cuda-hook /nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-cuda12.8-cuda_nvcc-12.8.93 /nix…
│       → /nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-cuda12.8-cuda_nvcc-12.8.93
└───lib/libonnxruntime.so.1.22.2: …y9-protobuf-32.1/lib:/nix/store/j55nsg143a22krndzhnr9f8ji78vlrdw-cuda12.8-nccl-2.27.6-1/lib:/nix…
    → /nix/store/j55nsg143a22krndzhnr9f8ji78vlrdw-cuda12.8-nccl-2.27.6-1
    └───lib/libnccl.so.2.27.6: …-m 64 -l:..&devrt -L /nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-9..l....c_nvcc-..../bin/..//lib…
        → /nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-cuda12.8-cuda_nvcc-12.8.93

Screenshot because of colors: screenshot of the terminal.

Just figured out --precise exists. To be clear, this is pre your PR, but I don't understand the behavior of the compiler here

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Nov 1, 2025

requisite onnxruntime -> cudart -> nvcc, and I think the last edge didn't used to be there either

getOutput "include" v. getInclude 🙃

EDIT: THERE'S STILL MORE
EDIT2: OH! Nvcc's outputs are no longer split 🫠
EDIT3: Oh, but actually we only ever split out static. I think we failed to split out include because nvcc itself was assuming the merged layout, and we were running into circular references. So the real new behaviour here is cudart trying to propagate crt/host_config.h automatically

@Janrupf
Copy link
Contributor

Janrupf commented Nov 1, 2025

Might also be worth testing with multiple compute capabilities enabled - the problem with the hashes occurred in fatbin sections, I think those are only present when multiple compute capabilities are enabled?

EDIT: If I read the compiler documentation at https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/, section 5.6.2 this indeed sounds like fatbin is only emitted when multiple compute capabilities are enabled

Code moved around in CUDA13 PR also changed actual behaviour
@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Nov 1, 2025

$ onnxruntime=$(nix build --refresh --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/fix/cuda13/firefox-x --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" "8.9" ]; allowUnfree = true; }' onnxruntime --print-out-paths)
$ nix-store --query --requisites "$onnxruntime" | grep nvcc

EDIT: Running a firefox build to make sure I'm not hallucinating

@SomeoneSerge
Copy link
Contributor Author

I'd be really confused if this change removed the reference magically from the binary, because it was embedded in the .nv_fatbin section in the ELF

In Nixpkgs we sometimes have to open ourselves up to extreme possibilities

image

@Janrupf
Copy link
Contributor

Janrupf commented Nov 1, 2025

I'm happy as long as the reference is gone, lets hope this doesn't magically re-appear when building the full config as is done for cache lmao

@SomeoneSerge
Copy link
Contributor Author

Good point, I'll start a full build as well

@SomeoneSerge
Copy link
Contributor Author

Alright, I guess we're watching this one: https://hydra.nixos-cuda.org/build/2901 (@GaetanLepage, added a one-off jobset... damn it's nice to just immediately have access to hardware)

@SomeoneSerge SomeoneSerge marked this pull request as ready for review November 1, 2025 04:35
@GaetanLepage
Copy link
Contributor

Still fails :'(

error: output '/nix/store/hms9xa16i406x9zxd9aci2i3cx5mg9zi-firefox-144.0.2' is not allowed to refer to the following paths:
         /nix/store/x8mydcgbry214s802nzvy7fdljx404ym-gcc-wrapper-14.3.0
error: 1 dependencies of derivation '/nix/store/h1jw7zmdwblqg5gih8bsyyhycm8xvqpg-review-shell.drv' failed to build

GaetanLepage added a commit to GaetanLepage/nix-config that referenced this pull request Nov 1, 2025
GaetanLepage added a commit to GaetanLepage/nix-config that referenced this pull request Nov 1, 2025
glepage-bot added a commit to GaetanLepage/nix-config that referenced this pull request Nov 1, 2025
* flake.lock: Update

* home/programs: du-dust was renamed to dust

* home/firefox: override firefox package to fix build with cudaSupport

NixOS/nixpkgs#457424

---------

Co-authored-by: GaetanLepage <33058747+GaetanLepage@users.noreply.github.com>
Co-authored-by: Gaetan Lepage <gaetan@glepage.com>
@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Nov 1, 2025

$ firefox=$(nix build --refresh --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/fix/cuda13/firefox --arg config '{ cudaSupport = true; allowUnfree = true; }' firefox-unwrapped --print-out-paths)
$ nvcc=$(nix-store --query --requisites "$firefox" | grep nvcc)
$ nix why-depends --precise "$firefox" "$nvcc"
/nix/store/7bhgsk9ck0vpbzyhy0zlvywnl0wzlisv-firefox-unwrapped-144.0.2
└───lib/firefox/libonnxruntime.so: …y9-protobuf-32.1/lib:/nix/store/d5y203f75h6ygm1cr3fbnwh5sn8fa668-cuda12.8-nccl-2.28.7-1/lib:/nix…
    → /nix/store/d5y203f75h6ygm1cr3fbnwh5sn8fa668-cuda12.8-nccl-2.28.7-1
    └───lib/libnccl.so.2.28.7: …-m 64 -l:..&devrt -L /nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-9..l....c_nvcc-..../bin/..//lib…
        → /nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-cuda12.8-cuda_nvcc-12.8.93

@GaetanLepage
Copy link
Contributor

I bisected the presence of cuda_nvcc in nccl's runtime dependencies.
It happened in e4a66e9 (between 24.11 and 25.05).

@GaetanLepage
Copy link
Contributor

On current master:

gaetan in 🌐 build02 in nixpkgs on  master
❮ nix-store --query --requisites "$(nix-build --arg config '{ allowUnfree = true; cudaSupport = true; }' -A cudaPackages_12_6.nccl)" | grep nvcc

gaetan in 🌐 build02 in nixpkgs on  master
❮ nix-store --query --requisites "$(nix-build --arg config '{ allowUnfree = true; cudaSupport = true; }' -A cudaPackages_12_8.nccl)" | grep nvcc
/nix/store/ygd3s9zm1pf77n3q3ac63v58www5scbc-cuda12.8-cuda_nvcc-12.8.93

FKouhai added a commit to FKouhai/nix-dots that referenced this pull request Nov 2, 2025
@GaetanLepage
Copy link
Contributor

Combining this PR and #457803 works!

❯ nix why-depends --precise $(nom-build --arg config '{ allowUnfree = true; cudaSupport = true; }' -A firefox-unwrapped) $(nom-build --arg config '{ allowUnfree = true; cudaSupport = true; }' -A cudaPackages.cuda_nvcc)
Finished at 20:17:12 after 1s
Finished at 20:17:13 after 0s
'/nix/store/vib4xmivlpznx50y6jr7smyd0d1l0yir-firefox-unwrapped-144.0.2' does not depend on '/nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-cuda12.8-cuda_nvcc-12.8.93'

@Janrupf
Copy link
Contributor

Janrupf commented Nov 2, 2025

I wonder if #457803 approach would be preferable? As in, is it actually a good idea to propagate stdenv.cc in nvcc and then strip out references after the fact? I'm not entirely sure what the reasoning behind the cuda hook propagation is, but at least in my mind it makes sense that a package which depends on nvcc also depends on gcc

@GaetanLepage
Copy link
Contributor

I wonder if #457803 approach would be preferable? As in, is it actually a good idea to propagate stdenv.cc in nvcc and then strip out references after the fact? I'm not entirely sure what the reasoning behind the cuda hook propagation is, but at least in my mind it makes sense that a package which depends on nvcc also depends on gcc

Well, #457803 fixes the illegitimate cuda_nvcc path leaks (in nccl's libnccl.so and onnxruntime's libonnxruntime_providers_cuda.so), but does not prevent libonnxruntime from depending on cuda_nvcc at runtime through cuda_cudart.

Hence, we really need both PRs to fix the firefox build with cudaSupport enabled.

@Janrupf
Copy link
Contributor

Janrupf commented Nov 2, 2025

What I mean is whether this PR should actually also be changed to strip out the references. The problem is that I don't understand why the changes in this PR prevent the references from showing up in the first place. As far as I can tell they are meta/debug information placed there by the compiler, and I'm missing the connection between removing the propagated inputs and the references going missing.

As far as I can tell a CUDA runtime library should never reference NVCC, and thus the issue may be lurking in other packages as well where it was just not found yet.

@SomeoneSerge
Copy link
Contributor Author

As far as I can tell they are meta/debug information placed there by the compiler, and I'm missing the connection between removing the propagated inputs and the references going missing.

My hypothesis was NVCC might have started recording flags it picks up from NVCC_{PREPEND,APPEND}_FLAGS, CUDAToolkit_INCLUDE_DIRS, etc c. I didn't have the time to debug this further, but the fact that the direct reference did disappear corroborates this version.

also be changed to strip out the references.

Yes, I think we merge this and #457803?

@nixpkgs-ci nixpkgs-ci bot added 12.approvals: 1 This PR was reviewed and approved by one person. 12.approved-by: package-maintainer This PR was reviewed and approved by a maintainer listed in any of the changed packages. labels Nov 3, 2025
@GaetanLepage GaetanLepage added this pull request to the merge queue Nov 3, 2025
Merged via the queue into NixOS:master with commit e617b8c Nov 3, 2025
32 of 34 checks passed
@github-project-automation github-project-automation bot moved this from New to ✅ Done in CUDA Team Nov 3, 2025
@UlyssesZh
Copy link
Member

This PR breaks building ucx. Error log.

@SomeoneSerge
Copy link
Contributor Author

@UlyssesZh probably the strictDeps comment from 1fceb96; try adding cuda_nvcc to buildInputs

@UlyssesZh
Copy link
Member

Could you please open a PR for that?

@SomeoneSerge
Copy link
Contributor Author

@UlyssesZh I'll try tomorrow

@Janrupf you were so much more on-point than I realized:

$ onnxruntime=$(nix build --refresh --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/7aebac99c040df73d15198784a6da075e971e4e1 --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" "8.9" ]; allowUnfree = true; }' onnxruntime --print-out-paths)
$ echo $onnxruntime
/nix/store/6id3nj2y9knjgsw4qiz3dlsw7gzwa2c7-onnxruntime-1.22.2
$ nix-store --query --requisites "$onnxruntime" | grep nvcc
$ onnxruntime=$(nix build --refresh --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/7aebac99c040df73d15198784a6da075e971e4e1 --arg config '{ cudaSupport = true; allowUnfree = true; }' onnxruntime --print-out-paths)
$ nvcc=$(nix-store --query --requisites "$onnxruntime" | grep nvcc)
$ nix why-depends --precise "$onnxruntime" "$nvcc"
/nix/store/hmz27nmhys5pm2f86d0c5793234r8d4r-onnxruntime-1.22.2
└───lib/libonnxruntime_providers_cuda.so: …,rt,@.3dev......! -L /nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-9.......c_nvcc-..../bin/..//lib…
    → /nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-cuda12.8-cuda_nvcc-12.8.93

I.e. this PR disappears the direct reference otherwise present in onnxruntime, but only for cudaCapabilities = [ "8.6" "8.9" ], not for the default larger set

@Janrupf
Copy link
Contributor

Janrupf commented Nov 5, 2025

I honestly think the entire "propagating gcc" think is actually correct for nvcc. Best is probably to find some way to strip out the references since it seems to be just metadata - I'm not sure where there is some case where someone would want to have an nvcc string in the binary. Maybe there is a way to specifically strip the references from the fatbin sections?

sliedes added a commit to sliedes/nixpkgs that referenced this pull request Nov 8, 2025
This is necessitated by the change made in NixOS#457424.
@SomeoneSerge SomeoneSerge mentioned this pull request Nov 10, 2025
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: cuda Parallel computing platform and API 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux. 11.by: package-maintainer This PR was created by a maintainer of all the package it changes. 12.approvals: 1 This PR was reviewed and approved by one person. 12.approved-by: package-maintainer This PR was reviewed and approved by a maintainer listed in any of the changed packages.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants