fix release-cuda eval errors#379768
Conversation
|
Requires proper scoping of commit messages, if you don't intend them to get squashed. |
|
@mweinelt my bad, I totally forgot about the commit messages. I am fine with either squashing or not. Although, I guess, the changes are big enough to keep the separate commits. The "commit conventions for P.S.: looks like this requires a manual rebase anyway |
d707bff to
c65ed11
Compare
|
I rebased and updated the commit names to be more in line with the commit conventions. Let me know if I missed anything else. |
|
Triggered an eval with these changes: https://hydra.nix-community.org/eval/265502 |
|
@zowoq thanks! Would I be correct in assuming that you've rebased this PR on top of 58edd1e (264446) before running it? Unfortunately, it's a bit hard to verify whether the eval errors decreased as expected since the Also, the |
Yes. I ran it again to get the errors: |
|
👍 Looks correct to me. Here is the same list of errors, but without the stack traces, one line per eval error. As mentioned previously, I'll probably work on the remaining errors in separate PRs. |
|
Rename the jobs from cudaPackages*.cudaPackages*.blah to just cudaPackages*.blah so that they match the nixpkgs attribute paths.
The theano package itself was originally removed in NixOS#313148.
Both `pytorch` and `torch` refer to the same package, but only the second one is the canonical name. The canonical `torch` name is already mentioned once a bit lower in the file.
Both `Keras` and `keras` refer to the same package, but only the second one is the canonical name.
Both `scikitimage` and `scikit-image` refer to the same package, but only the second one is the canonical name.
Using primary/canonical names for packages makes it easier to keep track of stuff. Additionally, this stops the recursion into cudaPackages* from including fake "aliases" for old (unsupported) cuda versions.
c65ed11 to
cf624a8
Compare
|
Rebased to fix merge conflict. |
cf624a8 to
04edda0
Compare
04edda0 to
98e9f51
Compare
|
As discussed in one of the threads and on matrix. Commits in the latest force-push remove the overengineered This should be ready to merge unless someone has any further concerns. |
pkgs/top-level/release-cuda.nix
Outdated
| # of their meta.{hydraPlatforms,platforms,badPlatforms} attributes | ||
| packageSets = builtins.filter (lib.strings.hasPrefix "cudaPackages") (builtins.attrNames pkgs); | ||
| evalPackageSet = pset: mapTestOn { ${pset} = packagePlatforms pkgs.${pset}; }; | ||
| evalPackageSetPlatforms = lib.genAttrs packageSets (pset: packagePlatforms pkgs.${pset}); |
There was a problem hiding this comment.
IIRC eval in evalPackageSet was meant as a verb (it was a function); the new variable is an attrset? packageSets? platformPackageSets?
Does the new code generate identical tree structure? AFAIU python3Packages.tensorflow.x86_64-linux in https://hydra.nix-community.org/job/nixpkgs/cuda/python3Packages.tensorflow.x86_64-linux corresponds to a path in this attrset
There was a problem hiding this comment.
This refactoring is technically no longer needed. We can drop commit 7d9676c, if you want (or we can keep the refactoring, but bikeshed the variable name a bit). This commit does not change any behavior (it just inlines the function call that used to be done on line 162).
I just left this refactor in for now, because I found the original structure a bit unintuitive.
evalPackageSetPlatforms is a nested attrSet where each leaf contains a list of Platforms on which each Package should be evaluated. I admit that the name could use some improvements.
So for example evalPackageSetPlatforms.cudaPackages.cudnn == [ "x86_64-linux" ], but evalPackageSetPlatforms.cudaPackages_12_3.cudnn_8_0 == [ ] (because of the badPlatformConditions).
mapTestOn then converts this attrset into an attrset of actual jobs that Hydra is going to evaluate/build by looking up those attribute paths in pkgs. The only change to the job names were done in a8e9088 - 0c12821 (and those changes should have made the job names match the attr paths better). So yes, the job name should match the attribute path in nixpkgs.
I find the current setup a bit easier to follow, because evalPackageSetPlatforms has the exact same structure as the explicit/literal attrset defined later in the file (with all the somepackage = linux; lines).
Perhaps it should be called something along the lines of packageSetPlatformsToEval or packagePlatformsToEval or packageSetPlatforms.
There was a problem hiding this comment.
There was a problem hiding this comment.
PS I played with
❯ nix repl -I nixpkgsRuro=https://github.com/ruro/nixpkgs/archive/fix-release-cuda-eval-errors.tar.gz
nix-repl> orig = import <nixpkgs/pkgs/top-level/release-cuda.nix> { }
nix-repl> new = import <nixpkgsRuro/pkgs/top-level/release-cuda.nix> { }
...and everything seems good
98e9f51 to
eeaac37
Compare
|
@SomeoneSerge Force pushed the refactoring changes as discussed on matrix. |
|
@ruro thank you so much for your contributions and for the persistence! |
Currently, the nix-community CUDA builder has a bunch of Eval Errors. This PR fixes some of the "lower hanging fruit" among those eval errors.
You can see the list of current eval errors by opening the "Evaluation Errors" tab for the
nixpkgs:cudajobset. Note that the "Evaluation Errors" tab for individual runs is currently broken, so only the jobset-level "Evaluation Errors" tab is populated (which contains evaluation errors for the latest job).You can reproduce the evaluation errors locally by running
(
hydra-eval-jobsis provided by thehydrapackage). When reviewing this PR, I recommend running the above command on each commit starting with the merge base and thendiffing the json data printed onstdoutto verify that each commit does what I claim it does.Things done
cudaPackages*prefix twiceThis is mostly a cosmetic change so that the job name matches the nixpkgs attribute path.
The next bunch of commits remove the accidental use of non-standard aliases for package names and then sets
allowAliases = falseto ensure that aliases aren't used in the future.Theanofrom the list of evaluated packages (Theano was removed from nixpkgs)pytorchwhich is a deprecated alias fortorchwhich is already being evaluatedKeras(deprecated alias) tokerasscikitimage(deprecated alias) toscikit-imageallowAliases = false(this ensures that aliases aren't used accidentally in the future)Notably, the last commit also stops trying to evaluate the deprecated
cudaPackages_10_*package sets and should also automatically stop evaluating CUDA versions prior to 12.0 after 25.05 gets released.release-cuda.nixThe next few commits disable the evaluation of "known" broken derivations by moving some
brokenConditionstobadPlatformConditions:cudnnnsight_systemsnvidia_driverAdditionally, the
cudaPackages.tensorrtpackage is marked withhydraPlatforms = none, because this package is fundamentally impossible to build on Hydra due torequireFiled sources and its non-redistributable LICENSE.tensorrtApplying the above changes should fix 238 out of 259 eval errors that are currently happening in the
nixpkgs:cudajobset on the nix-community builders. The remaining 21 errors are slightly more tricky, so I am leaving them for a later PR.BuiltEvaluated on platform(s)nix.conf? (See Nix manual)sandbox = relaxedsandbox = truenix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)Add a 👍 reaction to pull requests you find important.