cudaPackages: point nvcc at a compatible -ccbin#218265
cudaPackages: point nvcc at a compatible -ccbin#218265samuela merged 13 commits intoNixOS:masterfrom
Conversation
pkgs/development/compilers/cudatoolkit/redist/build-cuda-redist-package.nix
Outdated
Show resolved
Hide resolved
|
UPD: |
|
Result of 3 packages marked as broken and skipped:
5 packages failed to build:
69 packages built:
|
Failed derivations
|
|
Rebuilding |
04142e2 to
f63d5d3
Compare
f63d5d3 to
ae72b9d
Compare
There was a problem hiding this comment.
Would be more interesting to test [ "8.6" "7.5" ]. Should this preserve the order? Should this print a warning?
There was a problem hiding this comment.
It's my opinion that capabilities should be sorted, so I would want the order of the output to be invariant with respect to the order of the input (which should already be sorted). Although, I'd love to hear other views!
There was a problem hiding this comment.
The way we handle this parameter now, the order is significant. It's our semi-implicit convention that the last element goes into PTX. Maybe the take away is rather that we don't want this to be implicit:)
There was a problem hiding this comment.
Hm, I think you're right there -- the last capability in the list shouldn't be the one which gets turned into a virtual architecture.
Although, I do like the idea of having them ordered so packages can decide what to build for. For example, Magma doesn't support 8.6/8.9, so I can imagine at some point in the future Magma iterating over the list of cuda capabilities to find the greatest lower bound (in Magma's case, 8.0) and building for that architecture.
ae72b9d to
f9cb4d3
Compare
There was a problem hiding this comment.
the gcc/stdenv distinction here is subtle enough i believe it deserves a comment
There was a problem hiding this comment.
How about this one?
There was a problem hiding this comment.
I guess my confusion was rather why reference gcc directly instead of accessing it through stdenv
There was a problem hiding this comment.
Oh, right. So, I should explain that gcc here is what's we override in extension.nix based on versions.toml
Actually, maybe I should override stdenv too? Like so
There was a problem hiding this comment.
could we get away with only overriding stdenv and then pulling the gcc version from that stdenv?
There was a problem hiding this comment.
@samuela Good. I'm thinking about exposing that stdenv in cudaPackages (rather than cudatoolkit) then. I, however, feel uneasy about exposing it as cudaPackages.stdenv because it might affect people's expectations... E.g. since clangStdenv contains clang, people might think cudaPackages.stdenv contains nvcc
There was a problem hiding this comment.
Alt names I'm thinking of: cudaStdenv (might be misinterpreted the same way), backendStdenv (exactly what it is, but hard to pronounce 🤣 ). Do you like any?
There was a problem hiding this comment.
Good point! I agree that cudaStdenv could be slightly misleading. It's a little tricky to name... Maybe matchingStdenv? compatibleStdenv? Idk I'm happy with whatever you feel is most appropriate
There was a problem hiding this comment.
I kept the ugly name, because "backend for nvcc" seemed like the clearest description...
This is needed for faster builds when debugging the opencv derivation, and it's more consistent with other cuda-enabled packages -DCUDA_GENERATION seems to expect architecture names, so we refactor cudaFlags to facilitate easier extraction of the configured archnames
Make tensorflow (and a bunch of ther things) use CUDA-compatible toolchain. Introduces cudaPackages.backendStdenv
Co-authored-by: Connor Baker <ConnorBaker01@Gmail.com>
271c5a4 to
22f7656
Compare
|
Wooo, thanks for seeing this through @SomeoneSerge ! diff LGTM. @ConnorBaker are you ok with these changes? i saw that you two still have some convos open |
947b833 to
ac64f07
Compare
|
@samuela looks good to me! @SomeoneSerge thank you for all the work you put into this :) |
ConnorBaker
left a comment
There was a problem hiding this comment.
Looks good! Just questions for future reference/work
| minArch' = builtins.head (builtins.sort builtins.lessThan cudaArchitectures); | ||
| in | ||
| # If this fails some day, something must've changed and we should re-validate our assumptions | ||
| assert builtins.stringLength minArch' == 2; |
There was a problem hiding this comment.
Nit for later: This is lexicographic sorting right? Won't we run into issues starting with Blackwell (post-Hopper) because we'll have capabilities starting with a one? E.g., "100" < "50".
There was a problem hiding this comment.
Again, just for future stuff! We'd have until at least 2024 before this becomes a problem, and that's assuming they keep the same naming scheme.
My preference is to see this PR merged sooner rather than later so I can work on rebasing my PRs ;)
| nativeBuildInputs = [ which addOpenGLRunpath ]; | ||
| nativeBuildInputs = [ | ||
| which | ||
| addOpenGLRunpath |
There was a problem hiding this comment.
Would the autoAddOpenGLRunpathHook also work here, or do we need to manually invoke it in postFixup?
There was a problem hiding this comment.
autoAddOpenGLRunpathHook should work, but we better test
|
Another note for later (but not to delay the merge):
|
|
In the interest of keeping things moving I went ahead and merged. AFAIU there are still a 4 things left as TODOs for future PRs however:
Thanks @SomeoneSerge ! |
| cat <<EOF >> $out/nix-support/setup-hook | ||
| cmakeFlags+=' -DCUDA_TOOLKIT_ROOT_DIR=$out' | ||
| cmakeFlags+=' -DCUDA_HOST_COMPILER=${backendStdenv.cc}/bin' | ||
| cmakeFlags+=' -DCMAKE_CUDA_HOST_COMPILER=${backendStdenv.cc}/bin' |
There was a problem hiding this comment.
NOTE: nvidia/thrust treats this as a path to the executable, not parent directory
TODO: check if maybe nvidia/thrust actually does this right
EDIT of 2023-04-01: linking the old libstdc++ was a mistake; we should link the newest possible libstdc++, even if we use an older compiler; libstdc++ if backwards-compatible in the sense that if a process loads a newer libstdc++ first, and then loads a library built against an older libstdc++, everything should work just fine
This PR is a rather dirty hot-fix needed to resume caching cuda-enabled packages. I hope we can just merge this and remove the scum later
This PR specifically does not address the issue of downstream packages (like torch) consuming several major versions of gcc and libstdc++ (one from
stdenv, and one fromcudatoolkit.cc)Desiredata
NVCC uses a compatible backend by default
E.g. cuda11 uses
gcc11and links togcc11.cc.lib, even ifstdenvis atgcc12. This means successful buildPhase-sNo runtime linkage errors
E.g. no
libstdc++.so.6: version `GLIBCXX_3.4.30' not foundafter a successful build. This means successful checkPhase-sIn default configuration, a downstream package's runtime closure only includes one toolchain
E.g. we don't link cached packages against multiple versions of
libstdc++.soat once, and maybe there's a warning if we accidentally try to. This means smaller closures and fewer conflictsDescription of changes
This is a hot-fix to un-break cuda-enabled packages (like tensorflow, jaxlib, faiss, opencv, ...) after the gcc11->gcc12 bump. We should probably build the whole downstream packages with a compatible stdenv (such as gcc11Stdenv for cudaPackages_11), but just pointing nvcc at the right compiler seems to do the trick
We already used this hack for non-redist cudatoolkit. Now we use it more consistently.
This commit also re-links cuda packages against libstdc++ from the same "compatible" gcc, rather than the current stdenv. We didn't test if this is necessary -> need to revise in further PRs
NOTE: long-term we should make it possible to override -ccbin and use e.g. clang
NOTE: the
NVCC_PREPEND_FLAGSline pollutes build logs with warnings when e.g. cmake appends another-ccbinThings done
python3Packages.jaxlib,opencv4,python3Packages.opencv4faiss,python3Packages.torchpython3Packages.torchvision: still failsnix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)Related
Notify maintainers
@NixOS/cuda-maintainers @ConnorBaker @mcwitt