GPU access in the sandbox#256230
Conversation
|
73fbb93 to
c57b043
Compare
|
c57b043 to
8aa7cd8
Compare
b9037bb to
c3ab1c8
Compare
There was a problem hiding this comment.
IIRC this requires the nix-command experimental feature. Should I just assume the 1st argument to always be the path to the .drv and read it directly?
There was a problem hiding this comment.
When using remote builds, the .drv apparently isn't available on the remote builder by the time when the pre-build-hook executes.
Thoughts?
There was a problem hiding this comment.
I think it boils down to this feature/optimization:
When you are privileged (or building a content-addressed derivation), you can build a derivation without uploading all the .drv closure first. This is of course less secure (because you can forge arbitrary output hashes) but faster if the closure contains very large files as inputs. Or if the closure is large. The .drv you send is not trusted, nor trustable, and not written to the builder store. Just used for building and sending back the result.
When you are not privileged (and not building a content-addressed derivation), you need to send the whole closure, and the .drv file will be present, as its validity can be recursively verified.
See https://github.com/NixOS/nix/blob/d12c614ac75171421844f3706d89913c3d841460/src/build-remote/build-remote.cc#L302-L334 And the referenced comment https://github.com/NixOS/nix/blob/d12c614ac75171421844f3706d89913c3d841460/src/libstore/daemon.cc#L589-L622
samuela
left a comment
There was a problem hiding this comment.
you a real og for this one @SomeoneSerge !! 🚀
There was a problem hiding this comment.
is there a way we could avoid having a stringly-typed API here? eg something like
| requiredSystemFeatures = [ "cuda" ]; | |
| requiredSystemFeatures = [ system-features.cuda ]; |
I'm a fan of whatever works, but I think avoiding a stringly-typed interface would be preferable, ceteris paribus.
There was a problem hiding this comment.
There's no precedent for that AFAIK, but we could introduce some attribute in cudaPackages.
A related concern that I have (mentioned in the linked issue) is that in principle I'd like to have the feature name indicate the relevant cuda capabilities, but there are obvious complications...
There was a problem hiding this comment.
Actually, let me clarify: I made a preset with "cuda" because it seemed like the simplest string I could come up with. I'm wary of introducing other names or more infra around these names, because it's easy to slip down the rabbit hole. One obvious rabbit hole is that you could start making derivations that ask for requiredSystemFeatures = let formatFeature = capabilities: (concatStringsMapSep "-or-" (x: "sm_{x}") capabilities); [ (formatFeature cudaCapabilities) ] and deploying hosts that declare programs.nix-required-mounts.allowedPatterns.cuda.onFeatures = map allowformatFeature (genSubsets cudaCapabilities)
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
The final value, at least for now, would still be a string, but we could define somewhere a dictionary of valid literals and have people reference them rather than write the strings directly. Could be as silly as cudaPackages.cudaFeature = "cuda" and requiredSystemFeatures = [ cudaPackages.cudaFeature ].
We then could even extend this to cudaFeature = f cudaFlags
There was a problem hiding this comment.
But what are your thoughts on modeling the capabilities?
There was a problem hiding this comment.
I think (import <nixpkgs> { config.cudaCapabilities = [ "8.6" ]; }).blender.tests.cuda-available.requiredSystemFeatures = [ "cuda-sm_86" ] would be really great, but we run into the limitations of this mechanism very soon
There was a problem hiding this comment.
(Nix with an external scheduler when)
There was a problem hiding this comment.
I think we should model capabilities -- they're essentially features of the hardware which determine how different pieces of software should be built and whether or not they can run on the system.
I do think it's a bit out of scope of this PR though; @SomeoneSerge would you make an issue so we can track this and have a longer discussion there?
2d3a5fd to
6212ea2
Compare
Otherwise we crash Ofborg
An unwrapped check for `nix run`-ing on the host platform,
instead of `nix build`-ing in the sandbox
E.g.:
```
❯ nix run -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L blender.gpuChecks.cudaAvailable.unwrapped
Blender 4.0.1
Read prefs: "/home/user/.config/blender/4.0/config/userpref.blend"
CUDA is available
Blender quit
❯ nix build -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L blender.gpuChecks
blender> Blender 4.0.1
blender> could not get a list of mounted file-systems
blender> CUDA is available
blender> Blender quit
```
cabc319 to
d6b3abc
Compare
Wrong Monday. I now fixed the Waiting for ofborg and merging in the morning. |
d6b3abc to
df6d3d5
Compare
|
Result of 2 packages blacklisted:
2 packages built:
|
Runtime tests (derivations asking for a relaxed sandbox) are now expected at p.gpuCheck, p.gpuChecks.<name>, or at p.tests.<name>.gpuCheck.
df6d3d5 to
79a7186
Compare
|
This time around Ofborg dismisses 🚅 🚄 🚄 |
|
In some locations |
Description of changes
The PR introduces a way to expose devices and drivers (e.g. GPUs and
/run/opengl-driver/lib) in the Nix build's sandbox for specially marked derivations. The approach is that described in #225912: we introduce a pre-build-hook which would scan the.drvbeing built for the presence of a chosen tag in therequiredSystemFeaturesattribute and, upon a match, instruct Nix to mount a list of device-related paths. E.g. one could mark the derivation asmkDerivation { ...; requiredSystemFeatures = [ "cuda" ]; }to indicate that it would need to use CUDA during the build. Untagged derivations would still observe "pure" environment without any devices. Further, therequiredSystemFeaturesattribute is special, Nix uses it to determine whether a host is capable of building the derivation, and which remote builders to choose. A host that hasn't been set up withnix.settings.system-features(system-featuresinnix.conf(5)) would refuse to build the marked derivationOne application of this hook is to implement GPU tests as normal derivations (cf. the torch example in the diff).
Another use could be to use Nix to run ML/DL/NumberCrunching pipelines that utilize GPUs (e.g. https://github.com/SomeoneSerge/pkgs/blob/5bf5ea2b16696e70ade5a68063042dcf2e8f033b/python-packages/by-name/om/omnimotion/package.nix#L270-L291). Similarly, one could write derivations that render things using OpenGL or Vulkan
How to test:
Take a NixOS machine with an nvidia GPU and
hardware.opengl.enable = true.Try building
python3Packages.torch.tests.cudaAvailable, observe it's refused evaluationRebulid with
programs.nix-required-mounts.enable = true(Details
Build
python3Packages.torch.gpuChecks.cudaAvailablesuccessfullyTry dropping
cudafromgpuChecks.cudaAvailable'srequiredSystemFeatures, and observe the test failCC @NixOS/cuda-maintainers
Alternatives and related work
@tomberek suggests that one could have derivations ask for specific paths using
__impureHostDeps, currently undocumented and being used for Darwin buildsFor GPU tests, @Kiskae suggests one could instead use NixOS tests with the PCI passthrough
We could also possibly consider supporting variables like
CUDA_VISIBLE_DEVICESand/or re-using nvidia-docker/nvidia-container-toolkit because that's the interface many (singularity, docker) rely on. These currently do not support any static configuration, and in particular they rely on glibc internals to discover the host drivers in a very non-cross-platform way.Things done
sandbox = trueset innix.conf? (See Nix manual)nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)