Skip to content

python3Packages.torch: allow lazy loading libnvrtc#297590

Closed
SomeoneSerge wants to merge 5 commits intoNixOS:masterfrom
SomeoneSerge:fix/python3Packages.torch/libnvrtc
Closed

python3Packages.torch: allow lazy loading libnvrtc#297590
SomeoneSerge wants to merge 5 commits intoNixOS:masterfrom
SomeoneSerge:fix/python3Packages.torch/libnvrtc

Conversation

@SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Mar 21, 2024

Description of changes

Fixes #296179, refactors autoAddDriverRunapth and autoAddCudaCompatHook

CC @yannham

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Mar 21, 2024
@github-actions github-actions bot added the 6.topic: python Python is a high-level, general-purpose programming language. label Mar 21, 2024
Comment on lines +2 to +10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ofborg ofborg bot requested review from samuela, teh, thoughtpolice and tscholak March 21, 2024 01:18
@ofborg ofborg bot added 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 501-1000 This PR causes many rebuilds on Linux and should normally target the staging branches. labels Mar 21, 2024
@ryanswrt
Copy link

I tried this against my setup using torch-bin and the issue persists; do the fixes in torch/default.nix need to be applied to bin.nix?

Copy link
Contributor Author

@SomeoneSerge SomeoneSerge Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this against my setup using torch-bin and the issue persists; do the fixes in torch/default.nix need to be applied to bin.nix?

I didn't look at torch-bin although it's only natural it should be broken the same way.
I also noticed a typo in the setup-hook because of which the libnvrtc path got lost in torch too.

Suggested change
newPath="${newPath}${elfAddRunpathSuffix:+:}${elfAddRunpathSuffix}"
newPath="${newPath}${elfAddRunpathsSuffix:+:}${elfAddRunpathsSuffix}"

Copy link
Contributor

@yannham yannham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, instead of patching elf files for each path, we accumulate all the paths in an array and patch everything at once in the end, letting us change the order.

I really like this approach. We can only specify "insert before" currently, which isn't a complete way to control the path ordering, but I guess it's amply sufficient in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if autoFixElfFiles is only used there, I wonder if it's not simpler to inline it. As much as I liked my higher-order solution 😛 (or, another solution if we want to keep the code nicely modular in those different bash functions, would be to hardcode elfAddRunpathsAction instead of fixAction). Unless you envision autoFixElfFiles to be useful for other things?

@wegank wegank added the 12.approvals: 1 This PR was reviewed and approved by one person. label Mar 26, 2024
@SomeoneSerge SomeoneSerge force-pushed the fix/python3Packages.torch/libnvrtc branch from 9294ff8 to 5ca0617 Compare March 26, 2024 17:54
@SomeoneSerge SomeoneSerge force-pushed the fix/python3Packages.torch/libnvrtc branch from 5ca0617 to 026df8f Compare March 26, 2024 21:24
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg, I recovered the string-splitting in the last force-push, but I broke cuda_compat again because I do the string-splitting too late 🙈

@SomeoneSerge SomeoneSerge marked this pull request as draft March 26, 2024 21:51
@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 5, 2024
@breakds
Copy link
Contributor

breakds commented May 28, 2024

Wondering is this PR mainly pending resolving the conflicts? I see the comments from #296179 about jetson being affected - is that still a concern?

Thanks!

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented May 28, 2024

Wondering is this PR mainly pending resolving the conflicts?
@breakds

Pretty much, the PR stalled after the hooks were moved from cudaPackages to the top-level. We'd have to start over. But also, the fix to the pytorch problem specifically amounts to adding "${lib.getLib cudaPackages.cuda_nvrtc}/lib" to the RUNPATHs, that can be temporarily hacked in without touching autoFixElfFiles or anything...

EDIT: I'm also getting more skeptical about the bash approach, maybe we should've gone with e.g. python3Minimal (or perl, whatever is cheaper to bootstrap)

@breakds
Copy link
Contributor

breakds commented May 28, 2024

@SomeoneSerge Thanks for the clarification!

Pretty much, the PR stalled after the hooks were moved from cudaPackages to the top-level.

Is this related to the following warnings that I saw recently?

trace: warning: cudaPackages.autoAddDriverRunpath is deprecated, use pkgs.autoAddDriverRunpath instead
trace: warning: cudaPackages.autoAddDriverRunpath is deprecated, use pkgs.autoAddDriverRunpath instead
trace: warning: cudaPackages.autoFixElfFiles is deprecated, use pkgs.autoFixElfFiles instead
trace: warning: cudaPackages.autoAddOpenGLRunpathHook is deprecated, use pkgs.autoAddDriverRunpath instead

the fix to the pytorch problem specifically amounts to adding "${lib.getLib cudaPackages.cuda_nvrtc}/lib" to the RUNPATHs

I saw your fix mentioned in another thread which seems like a good temporary solution. Is it not acceptable to patch it like that at this moment?

I'm also getting more skeptical about the bash approach

I also find that bash scripts are not the easiest to read and to maintain ... but it is probably my problem :(

@SomeoneSerge
Copy link
Contributor Author

I saw your fix mentioned in another thread which seems like a good temporary solution. Is it not acceptable to patch it like that at this moment?

Yes and you're welcome to go ahead and implement it!

@CMCDragonkai
Copy link
Member

Would these be relevant to tensorflow?

@SomeoneSerge
Copy link
Contributor Author

Would these be relevant to tensorflow?

Does it dlopen libnvrtc?

@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 2, 2025
Comment on lines +10 to +36
arrayInsertBefore() {
local -n arrayRef="$1" # Namerefs, bash >= 4.3:
local pattern="$2"
local item="$3"
shift 3

if [[ $# -eq 0 ]]; then
echo "addCudaCompatRunpath: no library path provided" >&2
exit 1
elif [[ $# -gt 1 ]]; then
echo "addCudaCompatRunpath: too many arguments" >&2
exit 1
elif [[ "$1" == "" ]]; then
echo "addCudaCompatRunpath: empty library path" >&2
exit 1
else
libPath="$1"
fi
local i
local foundMatch=

origRpath="$(patchelf --print-rpath "$libPath")"
patchelf --set-rpath "@libcudaPath@:$origRpath" "$libPath"
local -a newArray
for i in "${arrayRef[@]}" ; do
if [[ "$i" == "$pattern" ]] ; then
newArray+=( "$item" )
foundMatch=1
fi
newArray+=( "$i" )
done
if [[ -z "$foundMatch" ]] ; then
newArray+=( "$item" )
fi
arrayRef=( "${newArray[@]}" )
}

postFixupHooks+=("autoFixElfFiles addCudaCompatRunpath")

if [[ -n "@libcudaPath@" ]] ; then
arrayInsertBefore elfPrependRunpaths "@driverLink@/lib" "@libcudaPath@"
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: #385960

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Nov 13, 2025

Stale for a long time, need to revisit the motivation and the situation, etc, etc.

EDIT: Maybe closed prematurely? #461334

@github-project-automation github-project-automation bot moved this from New to ✅ Done in CUDA Team Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.status: merge conflict This PR has merge conflicts with the target branch 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: cuda Parallel computing platform and API 6.topic: python Python is a high-level, general-purpose programming language. 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 501-1000 This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 12.approvals: 1 This PR was reviewed and approved by one person.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

python3Packages.torchWithCuda: lazy loading nvrtc

6 participants