-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Implement resource-management experimental feature #14466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -107,4 +107,18 @@ file included in `builders` via the syntax `@/path/to/file`. For example, | |
| causes the list of machines in `/etc/nix/machines` to be included. | ||
| (This is the default.) | ||
|
|
||
| [Nix instance]: @docroot@/glossary.md#gloss-nix-instance | ||
| [Nix instance]: @docroot@/glossary.md#gloss-nix-instance | ||
|
|
||
| ## Resource Management | ||
|
|
||
| Adding `resource-management` to the `experimental-features` setting in `nix.conf` enables a basic resource management scheme for system features. This is akin to what can be accomplished with job schedulers like Slurm, where a remote machine can have a limited quantity of a resource that can be temporarily "consumed" by a job. This can be used with memory-heavy builds, or derivations that require exclusive access to particular hardware resources. | ||
|
|
||
| Resource management is supported in both the supported features and mandatory features of a remote machine configuration, by appending a colon `:` to a feature name followed by the quantity that this machine has. This is tracked on a per-store basis, so different users on a multi-user installation share the same pool of resources for their remote build machines. A derivation specifies that it consumes a resource with the same notation in the `requiredSystemFeatures` attribute. | ||
|
|
||
| For example, this builder can provide exclusive access to two GPUs and 128G of memory for remote builds: | ||
|
|
||
| builders = ssh://gpu-node x86_64-linux - 32 1 gpu:2,mem:128 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to cement my understanding of this entire feature: This line is set in the configuration of the machine-that-wants-to-dispatch-to-builders, and the values are arbitrary (i.e. Recently, some of the more complex settings have been getting implemented as a JSON string (i.e. the external builders setting: #14145) -- I wonder if instead of making this part of the Not to sign you up for the much more involved work that would involve, but wondering if you've considered how more complex scheduling could be tackled as well (aside from rewriting the scheduler in its entirety...). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It means the same as just
I hadn't, but I think the existing implementation can be used to at least satisfy the examples you've given. A machine with >8G but <32G could be chosen by adding a I'd be interested to see what others think on the question of making it it's own field / setting. I'd also be interested in the question of forgoing this in favor of support for direct integration with job schedulers, which could handle the resource management themselves. I have some free time on my hands right now (read: laid off), so I'd be happy to work on this further depending on what others think the direction should be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, I've added a check for this. |
||
|
|
||
| A derivation that might use this machine may set its `requiredSystemFeatures` to `["gpu:1" "mem:4"]` to indicate that it requires a GPU and consumes 4G of system memory. A particularly memory-heavy derivation that doesn't need a GPU may still use the machine with a value of `["mem:64"]`. This helps ensure that limited system resources are not over-consumed by remote builds. Note that Nix does not do any actual delegation or enforcement of GPU, memory, or other resource usage, that is up to the derivations to manage. | ||
|
|
||
| When configuring the `system-features` setting on the remote machine's `nix.conf`, only include the name of the consumable feature, not the quantity availble. Resource limits are tracked on the dispatching end within the local store. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source common.sh | ||
|
|
||
| enableFeatures "resource-management" | ||
|
|
||
| requireSandboxSupport | ||
| [[ $busybox =~ busybox ]] || skipTest "no busybox" | ||
|
|
||
| here=$(readlink -f "$(dirname "${BASH_SOURCE[0]}")") | ||
| export NIX_USER_CONF_FILES=$here/config/nix-with-resource-management.conf | ||
|
|
||
| expectStderr 1 nix build -Lvf resource-management.nix \ | ||
| --arg busybox "$busybox" \ | ||
| --out-link "$TEST_ROOT/result-from-remote" \ | ||
| --store "$TEST_ROOT/local" \ | ||
| --builders "ssh-ng://localhost?system-features=testf - - 4 1 testf:1" \ | ||
| | grepQuiet "Failed to find a machine for remote build!" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source common.sh | ||
|
|
||
| enableFeatures "resource-management" | ||
|
|
||
| requireSandboxSupport | ||
| [[ $busybox =~ busybox ]] || skipTest "no busybox" | ||
|
|
||
| here=$(readlink -f "$(dirname "${BASH_SOURCE[0]}")") | ||
| export NIX_USER_CONF_FILES=$here/config/nix-with-resource-management.conf | ||
|
|
||
| nix build -Lvf resource-management.nix \ | ||
| --arg busybox "$busybox" \ | ||
| --out-link "$TEST_ROOT/result-from-remote" \ | ||
| --store "$TEST_ROOT/local" \ | ||
| --builders "ssh-ng://localhost?system-features=test - - 4 1 test:4" | ||
|
|
||
| grepQuiet 'Hello World!' < "$TEST_ROOT/result-from-remote/hello" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| experimental-features = resource-management nix-command | ||
| system-features = test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "this is tracked on a per-store basis" mean (which store)? The machine-that-wants-to-build-this-on-a-remote-builder's Nix store, or the machine-that-can-actually-build-this's Nix store? How is it tracked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, clarified this at the end of the doc entry.