Skip to content

cudaPackages.cudnn: migrate to redist cuda, fix missing zlib#168748

Merged
FRidh merged 7 commits intoNixOS:masterfrom
SomeoneSerge:fix-cudnn-zlib
Apr 19, 2022
Merged

cudaPackages.cudnn: migrate to redist cuda, fix missing zlib#168748
FRidh merged 7 commits intoNixOS:masterfrom
SomeoneSerge:fix-cudnn-zlib

Conversation

@SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Apr 15, 2022

Description of changes
  • Removes cudnn's dependency on (runfile-based) cudatoolkit, except for temporarily keeping it in propagatedBuildInputs
  • Adds a checkPhase that compares Runpath against DT_NEEDED via ldd
    • Try and replace it with autoPatchelfHook?
  • Adds missing zlib dependency to Runpath

The mentioned missing dependency leads to runtime errors when evaluating jit-traced models in cuda-enabled pytorch

CC @NixOS/cuda-maintainers

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.05 Release Notes (or backporting 21.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

@ofborg ofborg bot added 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. labels Apr 15, 2022
@SomeoneSerge
Copy link
Contributor Author

Since I've removed propagatedBuildInputs after all, let's wait for the builds to finish: https://hercules-ci.com/github/SomeoneSerge/nixpkgs-unfree/jobs/397

@SomeoneSerge
Copy link
Contributor Author

I've copied autoAddOpenGLRunpath approach over from #168755 and rebuilt the derivation: all seems fine, /run/opengl-driver/lib in place, ldd has no complaints.

Also re-runing CI.

@SomeoneSerge SomeoneSerge marked this pull request as ready for review April 15, 2022 07:23
@ofborg ofborg bot added 8.has: clean-up This PR removes packages or removes other cruft 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. and removed 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. labels Apr 15, 2022
@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: changelog This PR adds or changes release notes 8.has: documentation This PR adds or changes documentation labels Apr 15, 2022
@SomeoneSerge SomeoneSerge requested review from FRidh and samuela April 15, 2022 14:04
@SomeoneSerge

This comment was marked as outdated.

@SomeoneSerge SomeoneSerge force-pushed the fix-cudnn-zlib branch 2 times, most recently from 5818564 to a1feea7 Compare April 15, 2022 14:30
@SomeoneSerge SomeoneSerge marked this pull request as draft April 15, 2022 17:27
@github-actions github-actions bot removed 8.has: documentation This PR adds or changes documentation 8.has: changelog This PR adds or changes release notes 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS labels Apr 15, 2022
@SomeoneSerge SomeoneSerge marked this pull request as ready for review April 15, 2022 18:04
@ofborg ofborg bot added 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. and removed 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. labels Apr 15, 2022
@SomeoneSerge SomeoneSerge marked this pull request as draft April 15, 2022 19:19
@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Apr 15, 2022

Ok, autoPatchelf has removed $ORIGIN and we can't dlopen("libcudnn_ops_infer.so") again (I guess...)

@SomeoneSerge SomeoneSerge marked this pull request as ready for review April 15, 2022 20:46
@SomeoneSerge SomeoneSerge changed the title cudaPackages.cudnn: migrate to redist cuda and fix missing dependencies cudaPackages.cudnn: migrate to redist cuda, fix missing zlib Apr 15, 2022
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move it to args, but I'd have to split 8.3.2.44 into two instances - pre and post cuda 11.4 - and adjust the selector expression smh... I'll keep it this way, I guess

@SomeoneSerge
Copy link
Contributor Author

@SuperSandro2000 @FRidh would you have any objections now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May seem redundant, but the idea is to repeat a pattern used elsewhere anyway, e.g. in pytorch. The pattern being: that packages use cuda as a "toolchain" with chosen hosts and targets, and we decide in the derivation which modules to include in the toolchain. Runfile being a maybe-non-redist throw-all-in toolchain.

Higher-level goal: I want people like Sandro (maintainers but out of narrow cuda context) not to get confused when just skimming through changes

@SomeoneSerge
Copy link
Contributor Author

@FRidh @SuperSandro2000 I'd like to either get new change requests or a merge, because right now pytorch and cudnn are still broken (runtime errors) on master1

Footnotes

  1. ...I'm still trying to distill my failure example into something I can publish, but so far not too successful

@FRidh
Copy link
Member

FRidh commented Apr 18, 2022

warnings are not allowed and need to be removed. Other than that, it looks good to me.

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Apr 18, 2022

I removed the warnings, rebased and squashed the branch to remove the noise, fixed an error with cudnn_7_6_5 that I only noticed with nixpkgs-review.

It seems the full nixpkgs-review is going to take ages (triggered tensorflow, jaxlib, caffe, etc).
I think I may have broken katago - it may have relied on cudnn's propagatedBuildInputs.
I'll see if I can fix it quick, or if I should mark it as conditionally broken

EDIT: actually why, I'll just give katago the old cudatoolkit in buildInputs
EDIT2: now katago finds cuda compiler just fine, but otherwise it was broken on master anyway

SomeoneSerge and others added 5 commits April 18, 2022 19:32
instead of custom find ... -exec ldd | grep routine
mark libcudnn_cnn_infer.so as needed for libcudnn.so on cudnn>=8.0.5
- a hint for autoPatchelf, as an alternative to manually adding $ORIGIN
as a more common way to use addOpenGLRunpath and autoPatchelf with cudaPackages
...since cudnn is part of the cuda package set

- introduces the scary useCudatoolkitRunfile function argument
  to discourage usage of the runfile-based cudatoolkit
- instead of the rather hidden useRedist term in let ... in
- repeats cudatoolkit_root pattern after cuda_joined in pytorch &c
  (the "toolchain view")
- redist packages are marked optional to support cuda<11.4 where the
  attributes for redist packages do not exist
Co-authored-by: Sandro <sandro.jaeckel@gmail.com>
@SomeoneSerge
Copy link
Contributor Author

12 packages updated:
caffe caffe caffe katago python3.10-cupy python3.10-pytorch python3.10-Theano python3.9-cupy python3.9-jaxlib python3.9-pytorch python3.9-tensorflow-gpu python3.9-Theano

@SomeoneSerge
Copy link
Contributor Author

@FRidh Ready

@ofborg ofborg bot requested a review from OmnipotentEntity April 18, 2022 16:48
Copy link
Member

@samuela samuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM. @FRidh @SuperSandro2000 are we good to merge?

@SomeoneSerge SomeoneSerge mentioned this pull request Apr 18, 2022
15 tasks
@FRidh FRidh merged commit d924de5 into NixOS:master Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: cuda Parallel computing platform and API 8.has: clean-up This PR removes packages or removes other cruft 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants