cudaPackages: fix regression, propagating too much#457424
cudaPackages: fix regression, propagating too much#457424GaetanLepage merged 1 commit intoNixOS:masterfrom
Conversation
Not sure if you have seen that I edited my comment, this is my guess as to how they ended up there |
|
Good news: with this build, I see no reference to nvcc in |
|
What exactly were you grepping for in |
|
@Janrupf for the hash part:) $ onnxruntime=$(nix build --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/fix/cuda13/firefox --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.9" ]; allowUnfree = true; }' onnxruntime --print-out-paths)
$ nix-store --query --references "$onnxruntime" | grep nvcc
$ nix-store --query --requisites "$onnxruntime" | grep nvcc
/nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-cuda12.8-cuda_nvcc-12.8.93 |
EDIT: THERE'S STILL MORE |
|
Might also be worth testing with multiple compute capabilities enabled - the problem with the hashes occurred in fatbin sections, I think those are only present when multiple compute capabilities are enabled? EDIT: If I read the compiler documentation at https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/, section 5.6.2 this indeed sounds like fatbin is only emitted when multiple compute capabilities are enabled |
Code moved around in CUDA13 PR also changed actual behaviour
6311cc4 to
7aebac9
Compare
$ onnxruntime=$(nix build --refresh --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/fix/cuda13/firefox-x --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" "8.9" ]; allowUnfree = true; }' onnxruntime --print-out-paths)
$ nix-store --query --requisites "$onnxruntime" | grep nvccEDIT: Running a firefox build to make sure I'm not hallucinating |
|
I'm happy as long as the reference is gone, lets hope this doesn't magically re-appear when building the full config as is done for cache lmao |
|
Good point, I'll start a full build as well |
|
Alright, I guess we're watching this one: https://hydra.nixos-cuda.org/build/2901 (@GaetanLepage, added a one-off jobset... damn it's nice to just immediately have access to hardware) |
|
Still fails :'( |
* flake.lock: Update * home/programs: du-dust was renamed to dust * home/firefox: override firefox package to fix build with cudaSupport NixOS/nixpkgs#457424 --------- Co-authored-by: GaetanLepage <33058747+GaetanLepage@users.noreply.github.com> Co-authored-by: Gaetan Lepage <gaetan@glepage.com>
$ firefox=$(nix build --refresh --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/fix/cuda13/firefox --arg config '{ cudaSupport = true; allowUnfree = true; }' firefox-unwrapped --print-out-paths)
$ nvcc=$(nix-store --query --requisites "$firefox" | grep nvcc)
$ nix why-depends --precise "$firefox" "$nvcc"
/nix/store/7bhgsk9ck0vpbzyhy0zlvywnl0wzlisv-firefox-unwrapped-144.0.2
└───lib/firefox/libonnxruntime.so: …y9-protobuf-32.1/lib:/nix/store/d5y203f75h6ygm1cr3fbnwh5sn8fa668-cuda12.8-nccl-2.28.7-1/lib:/nix…
→ /nix/store/d5y203f75h6ygm1cr3fbnwh5sn8fa668-cuda12.8-nccl-2.28.7-1
└───lib/libnccl.so.2.28.7: …-m 64 -l:..&devrt -L /nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-9..l....c_nvcc-..../bin/..//lib…
→ /nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-cuda12.8-cuda_nvcc-12.8.93 |
|
I bisected the presence of |
|
On current |
|
Combining this PR and #457803 works! |
|
I wonder if #457803 approach would be preferable? As in, is it actually a good idea to propagate |
Well, #457803 fixes the illegitimate Hence, we really need both PRs to fix the |
|
What I mean is whether this PR should actually also be changed to strip out the references. The problem is that I don't understand why the changes in this PR prevent the references from showing up in the first place. As far as I can tell they are meta/debug information placed there by the compiler, and I'm missing the connection between removing the propagated inputs and the references going missing. As far as I can tell a CUDA runtime library should never reference NVCC, and thus the issue may be lurking in other packages as well where it was just not found yet. |
My hypothesis was NVCC might have started recording flags it picks up from
Yes, I think we merge this and #457803? |
|
This PR breaks building |
|
@UlyssesZh probably the |
|
Could you please open a PR for that? |
|
@UlyssesZh I'll try tomorrow @Janrupf you were so much more on-point than I realized: $ onnxruntime=$(nix build --refresh --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/7aebac99c040df73d15198784a6da075e971e4e1 --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" "8.9" ]; allowUnfree = true; }' onnxruntime --print-out-paths)
$ echo $onnxruntime
/nix/store/6id3nj2y9knjgsw4qiz3dlsw7gzwa2c7-onnxruntime-1.22.2
$ nix-store --query --requisites "$onnxruntime" | grep nvcc
$ onnxruntime=$(nix build --refresh --no-link -f "<nixpkgs>" -I nixpkgs=flake:github:SomeoneSerge/nixpkgs/7aebac99c040df73d15198784a6da075e971e4e1 --arg config '{ cudaSupport = true; allowUnfree = true; }' onnxruntime --print-out-paths)
$ nvcc=$(nix-store --query --requisites "$onnxruntime" | grep nvcc)
$ nix why-depends --precise "$onnxruntime" "$nvcc"
/nix/store/hmz27nmhys5pm2f86d0c5793234r8d4r-onnxruntime-1.22.2
└───lib/libonnxruntime_providers_cuda.so: …,rt,@.3dev......! -L /nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-9.......c_nvcc-..../bin/..//lib…
→ /nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-cuda12.8-cuda_nvcc-12.8.93I.e. this PR disappears the direct reference otherwise present in |
|
I honestly think the entire "propagating gcc" think is actually correct for nvcc. Best is probably to find some way to strip out the references since it seems to be just metadata - I'm not sure where there is some case where someone would want to have an nvcc string in the binary. Maybe there is a way to specifically strip the references from the fatbin sections? |
This is necessitated by the change made in NixOS#457424.


Quite likely fixes #457391 #457406, but I'm waiting for the tests to make sure. Even if it does fix, I'm still analyzing how this could have lead to the extraneous reference in onnxrutime.
Regardless of the bearing on onnxruntime and firefox, this is the pre-refactoring logic that probably shouldn't have been changed (we've been propagating strictly less, and that seemed to have been sufficient).
Things done
passthru.tests.nixpkgs-reviewon this PR. See nixpkgs-review usage../result/bin/.Add a 👍 reaction to pull requests you find important.