Skip to content

Commit c97f3f6

Browse files
Implement resource-management experimental feature
Adds basic resource tracking to system features used by distributed builds, similar to resource management in job schedulers like Slurm. Includes a positive and negative functional test and a documentation update to the distributed builds section. Resolves #2307 Signed-off-by: Lisanna Dettwyler <[email protected]>
1 parent 5390bba commit c97f3f6

File tree

13 files changed

+286
-15
lines changed

13 files changed

+286
-15
lines changed

doc/manual/source/advanced-topics/distributed-builds.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,4 +107,18 @@ file included in `builders` via the syntax `@/path/to/file`. For example,
107107
causes the list of machines in `/etc/nix/machines` to be included.
108108
(This is the default.)
109109

110-
[Nix instance]: @docroot@/glossary.md#gloss-nix-instance
110+
[Nix instance]: @docroot@/glossary.md#gloss-nix-instance
111+
112+
## Resource Management
113+
114+
Adding `resource-management` to the `experimental-features` setting in `nix.conf` enables a basic resource management scheme for system features. This is akin to what can be accomplished with job schedulers like Slurm, where a remote machine can have a limited quantity of a resource that can be temporarily "consumed" by a job. This can be used with memory-heavy builds, or derivations that require exclusive access to particular hardware resources.
115+
116+
Resource management is supported in both the supported features and mandatory features of a remote machine configuration, by appending a colon `:` to a feature name followed by the quantity that this machine has. This is tracked on a per-store basis, so different users on a multi-user installation share the same pool of resources for their remote build machines. A derivation specifies that it consumes a resource with the same notation in the `requiredSystemFeatures` attribute.
117+
118+
For example, this builder can provide exclusive access to two GPUs and 128G of memory for remote builds:
119+
120+
builders = ssh://gpu-node x86_64-linux - 32 1 gpu:2,mem:128
121+
122+
A derivation that might use this machine may set its `requiredSystemFeatures` to `["gpu:1" "mem:4"]` to indicate that it requires a GPU and consumes 4G of system memory. A particularly memory-heavy derivation that doesn't need a GPU may still use the machine with a value of `["mem:64"]`. This helps ensure that limited system resources are not over-consumed by remote builds. Note that Nix does not do any actual delegation or enforcement of GPU, memory, or other resource usage, that is up to the derivations to manage.
123+
124+
When configuring the `system-features` setting on the remote machine's `nix.conf`, only include the name of the consumable feature, not the quantity availble. Resource limits are tracked on the dispatching end, and are tracked on a per-store basis.

src/libstore/derivation-options.cc

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
#include "nix/util/types.hh"
77
#include "nix/util/util.hh"
88
#include "nix/store/globals.hh"
9+
#include "nix/store/machines.hh"
910

1011
#include <optional>
1112
#include <string>
@@ -305,9 +306,16 @@ bool DerivationOptions::canBuildLocally(Store & localStore, const BasicDerivatio
305306
if (settings.maxBuildJobs.get() == 0 && !drv.isBuiltin())
306307
return false;
307308

308-
for (auto & feature : getRequiredSystemFeatures(drv))
309-
if (!localStore.config.systemFeatures.get().count(feature))
310-
return false;
309+
auto features = getRequiredSystemFeatures(drv);
310+
if (experimentalFeatureSettings.isEnabled(Xp::ResourceManagement) || true) {
311+
auto featureCount = Machine::countFeatures(features);
312+
for (auto & feature : featureCount)
313+
if (!localStore.config.systemFeatures.get().count(feature.first))
314+
return false;
315+
} else
316+
for (auto & feature : features)
317+
if (!localStore.config.systemFeatures.get().count(feature))
318+
return false;
311319

312320
return true;
313321
}

src/libstore/include/nix/store/machines.hh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ struct Machine;
1212

1313
typedef std::vector<Machine> Machines;
1414

15+
typedef std::map<std::string, unsigned int> FeatureCount;
16+
1517
struct Machine
1618
{
1719

@@ -21,7 +23,9 @@ struct Machine
2123
const unsigned int maxJobs;
2224
const float speedFactor;
2325
const StringSet supportedFeatures;
26+
const FeatureCount supportedFeaturesCount;
2427
const StringSet mandatoryFeatures;
28+
const FeatureCount mandatoryFeaturesCount;
2529
const std::string sshPublicHostKey;
2630
bool enabled = true;
2731

@@ -77,6 +81,11 @@ struct Machine
7781
* the same format.
7882
*/
7983
static Machines parseConfig(const StringSet & defaultSystems, const std::string & config);
84+
85+
/**
86+
* Count the number of each feature specified in a feature string.
87+
*/
88+
static FeatureCount countFeatures(const StringSet & features);
8089
};
8190

8291
/**

src/libstore/machines.cc

Lines changed: 54 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,9 @@ Machine::Machine(
3131
, maxJobs(maxJobs)
3232
, speedFactor(speedFactor == 0.0f ? 1.0f : speedFactor)
3333
, supportedFeatures(supportedFeatures)
34+
, supportedFeaturesCount(countFeatures(supportedFeatures))
3435
, mandatoryFeatures(mandatoryFeatures)
36+
, mandatoryFeaturesCount(countFeatures(mandatoryFeatures))
3537
, sshPublicHostKey(sshPublicHostKey)
3638
{
3739
if (speedFactor < 0.0)
@@ -45,16 +47,48 @@ bool Machine::systemSupported(const std::string & system) const
4547

4648
bool Machine::allSupported(const StringSet & features) const
4749
{
48-
return std::all_of(features.begin(), features.end(), [&](const std::string & feature) {
49-
return supportedFeatures.count(feature) || mandatoryFeatures.count(feature);
50-
});
50+
if (experimentalFeatureSettings.isEnabled(Xp::ResourceManagement)) {
51+
auto featuresCount = countFeatures(features);
52+
return std::all_of(featuresCount.begin(), featuresCount.end(), [&](const auto & f) {
53+
return (
54+
supportedFeaturesCount.count(f.first) > 0 && ( // feature is supported, and
55+
supportedFeaturesCount.at(f.first) >= f.second || // we have the quantity of it needed or
56+
supportedFeaturesCount.at(f.first) == 0 // we have a limitless supply of it
57+
)
58+
) || (
59+
mandatoryFeaturesCount.count(f.first) > 0 && (
60+
mandatoryFeaturesCount.at(f.first) >= f.second ||
61+
mandatoryFeaturesCount.at(f.first) == 0
62+
)
63+
);
64+
});
65+
} else {
66+
return std::all_of(features.begin(), features.end(), [&](const std::string & feature) {
67+
return supportedFeatures.count(feature) || mandatoryFeatures.count(feature);
68+
});
69+
}
5170
}
5271

5372
bool Machine::mandatoryMet(const StringSet & features) const
5473
{
55-
return std::all_of(mandatoryFeatures.begin(), mandatoryFeatures.end(), [&](const std::string & feature) {
56-
return features.count(feature);
57-
});
74+
if (experimentalFeatureSettings.isEnabled(Xp::ResourceManagement)) {
75+
auto featureCount = countFeatures(features);
76+
return std::all_of(mandatoryFeaturesCount.begin(), mandatoryFeaturesCount.end(), [&](const auto & feature) {
77+
return featureCount.count(feature.first);
78+
});
79+
} else {
80+
return std::all_of(mandatoryFeatures.begin(), mandatoryFeatures.end(), [&](const std::string & feature) {
81+
return features.count(feature);
82+
});
83+
}
84+
}
85+
86+
std::string escapeUri(std::string uri)
87+
{
88+
if (uri.find(':') != std::string::npos) {
89+
uri.replace(uri.find(':'), 3, "%3A");
90+
}
91+
return uri;
5892
}
5993

6094
StoreReference Machine::completeStoreReference() const
@@ -81,7 +115,7 @@ StoreReference Machine::completeStoreReference() const
81115
for (auto & f : feats) {
82116
if (fs.size() > 0)
83117
fs += ' ';
84-
fs += f;
118+
fs += escapeUri(f);
85119
}
86120
};
87121
append(supportedFeatures);
@@ -207,6 +241,19 @@ Machines Machine::parseConfig(const StringSet & defaultSystems, const std::strin
207241
return parseBuilderLines(defaultSystems, builderLines);
208242
}
209243

244+
FeatureCount Machine::countFeatures(const StringSet & features)
245+
{
246+
FeatureCount fc;
247+
for (auto & f : features) {
248+
std::istringstream fss(f);
249+
std::string name;
250+
std::string quantity;
251+
std::getline(fss, name, ':');
252+
fc.emplace(name, std::getline(fss, quantity) ? std::stoul(quantity) : 0);
253+
};
254+
return fc;
255+
}
256+
210257
Machines getMachines()
211258
{
212259
return Machine::parseConfig({settings.thisSystem}, settings.builders);

src/libutil/experimental-features.cc

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ struct ExperimentalFeatureDetails
2525
* feature, we either have no issue at all if few features are not added
2626
* at the end of the list, or a proper merge conflict if they are.
2727
*/
28-
constexpr size_t numXpFeatures = 1 + static_cast<size_t>(Xp::BLAKE3Hashes);
28+
constexpr size_t numXpFeatures = 1 + static_cast<size_t>(Xp::ResourceManagement);
2929

3030
constexpr std::array<ExperimentalFeatureDetails, numXpFeatures> xpFeatureDetails = {{
3131
{
@@ -321,6 +321,14 @@ constexpr std::array<ExperimentalFeatureDetails, numXpFeatures> xpFeatureDetails
321321
)",
322322
.trackingUrl = "",
323323
},
324+
{
325+
.tag = Xp::ResourceManagement,
326+
.name = "resource-management",
327+
.description = R"(
328+
Enables support for resource management in remote build system features.
329+
)",
330+
.trackingUrl = "",
331+
}
324332
}};
325333

326334
static_assert(

src/libutil/include/nix/util/experimental-features.hh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ enum struct ExperimentalFeature {
3939
PipeOperators,
4040
ExternalBuilders,
4141
BLAKE3Hashes,
42+
ResourceManagement,
4243
};
4344

4445
/**

src/nix/build-remote/build-remote.cc

Lines changed: 64 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,14 @@ std::string escapeUri(std::string uri)
3737

3838
static std::string currentLoad;
3939

40-
static AutoCloseFD openSlotLock(const Machine & m, uint64_t slot)
40+
static AutoCloseFD openSlotLock(const std::string storeUri, uint64_t slot)
4141
{
42-
return openLockFile(fmt("%s/%s-%d", currentLoad, escapeUri(m.storeUri.render()), slot), true);
42+
return openLockFile(fmt("%s/%s-%d", currentLoad, escapeUri(storeUri), slot), true);
43+
}
44+
45+
static AutoCloseFD openFeatureSlotLock(const std::string storeUri, const std::string feature, unsigned int slot)
46+
{
47+
return openLockFile(fmt("%s/%s-%s-%d", currentLoad, escapeUri(storeUri), feature, slot), true);
4348
}
4449

4550
static bool allSupportedLocally(Store & store, const StringSet & requiredFeatures)
@@ -50,6 +55,47 @@ static bool allSupportedLocally(Store & store, const StringSet & requiredFeature
5055
return true;
5156
}
5257

58+
using FeatureSlotLocks = std::map<std::string, std::vector<AutoCloseFD>>;
59+
60+
static bool tryReserveFeatures(
61+
const Machine & m,
62+
const FeatureCount requiredFeatures,
63+
FeatureSlotLocks & featureSlotLocks
64+
) {
65+
bool allSatisfied = true;
66+
for (auto & f : requiredFeatures) {
67+
if (!f.second) {
68+
continue;
69+
}
70+
std::vector<AutoCloseFD> locks(f.second);
71+
unsigned int numLocked = 0;
72+
for (unsigned int s = 0;
73+
numLocked < f.second && (
74+
(m.supportedFeaturesCount.find(f.first) != m.supportedFeaturesCount.end() &&
75+
s < m.supportedFeaturesCount.at(f.first)) ||
76+
(m.mandatoryFeaturesCount.find(f.first) != m.mandatoryFeaturesCount.end() &&
77+
s < m.mandatoryFeaturesCount.at(f.first))
78+
); ++s) {
79+
auto lock = openFeatureSlotLock(m.storeUri.render(), f.first, s);
80+
if (lockFile(lock.get(), ltWrite, false)) {
81+
locks[numLocked] = std::move(lock);
82+
++numLocked;
83+
}
84+
}
85+
if (numLocked < f.second) {
86+
allSatisfied = false;
87+
break;
88+
}
89+
auto & fslDest = featureSlotLocks[f.first];
90+
fslDest.insert(fslDest.end(), std::make_move_iterator(locks.begin()),
91+
std::make_move_iterator(locks.end()));
92+
}
93+
if (!allSatisfied) {
94+
featureSlotLocks.clear();
95+
}
96+
return allSatisfied;
97+
}
98+
5399
static int main_build_remote(int argc, char ** argv)
54100
{
55101
{
@@ -93,6 +139,7 @@ static int main_build_remote(int argc, char ** argv)
93139

94140
std::shared_ptr<Store> sshStore;
95141
AutoCloseFD bestSlotLock;
142+
FeatureSlotLocks bestFeatureSlotLocks;
96143

97144
auto machines = getMachines();
98145
debug("got %d remote builders", machines.size());
@@ -119,6 +166,12 @@ static int main_build_remote(int argc, char ** argv)
119166
auto neededSystem = readString(source);
120167
drvPath = store->parseStorePath(readString(source));
121168
auto requiredFeatures = readStrings<StringSet>(source);
169+
auto requiredFeaturesCount = Machine::countFeatures(requiredFeatures);
170+
bool needsResourceManagement = 0 < std::accumulate(
171+
requiredFeaturesCount.begin(), requiredFeaturesCount.end(), 0,
172+
[](auto total, auto feature) {
173+
return std::move(total) + feature.second;
174+
});
122175

123176
/* It would be possible to build locally after some builds clear out,
124177
so don't show the warning now: */
@@ -150,7 +203,7 @@ static int main_build_remote(int argc, char ** argv)
150203
AutoCloseFD free;
151204
uint64_t load = 0;
152205
for (uint64_t slot = 0; slot < m.maxJobs; ++slot) {
153-
auto slotLock = openSlotLock(m, slot);
206+
auto slotLock = openSlotLock(m.storeUri.render(), slot);
154207
if (lockFile(slotLock.get(), ltWrite, false)) {
155208
if (!free) {
156209
free = std::move(slotLock);
@@ -162,6 +215,13 @@ static int main_build_remote(int argc, char ** argv)
162215
if (!free) {
163216
continue;
164217
}
218+
FeatureSlotLocks featureSlotLocks;
219+
if (needsResourceManagement &&
220+
experimentalFeatureSettings.isEnabled(Xp::ResourceManagement)) {
221+
if (!tryReserveFeatures(m, requiredFeaturesCount, featureSlotLocks)) {
222+
continue;
223+
}
224+
}
165225
bool best = false;
166226
if (!bestSlotLock) {
167227
best = true;
@@ -179,6 +239,7 @@ static int main_build_remote(int argc, char ** argv)
179239
if (best) {
180240
bestLoad = load;
181241
bestSlotLock = std::move(free);
242+
bestFeatureSlotLocks = std::move(featureSlotLocks);
182243
bestMachine = &m;
183244
}
184245
}
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/usr/bin/env bash
2+
3+
source common.sh
4+
5+
enableFeatures "resource-management"
6+
7+
requireSandboxSupport
8+
[[ $busybox =~ busybox ]] || skipTest "no busybox"
9+
10+
here=$(readlink -f "$(dirname "${BASH_SOURCE[0]}")")
11+
export NIX_USER_CONF_FILES=$here/config/nix-with-resource-management.conf
12+
13+
expectStderr 1 nix build -Lvf resource-management.nix \
14+
--arg busybox "$busybox" \
15+
--out-link "$TEST_ROOT/result-from-remote" \
16+
--store "$TEST_ROOT/local" \
17+
--builders "ssh-ng://localhost?system-features=testf - - 4 1 testf:1" \
18+
| grepQuiet "Failed to find a machine for remote build!"
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
#!/usr/bin/env bash
2+
3+
source common.sh
4+
5+
enableFeatures "resource-management"
6+
7+
requireSandboxSupport
8+
[[ $busybox =~ busybox ]] || skipTest "no busybox"
9+
10+
here=$(readlink -f "$(dirname "${BASH_SOURCE[0]}")")
11+
export NIX_USER_CONF_FILES=$here/config/nix-with-resource-management.conf
12+
13+
nix build -Lvf resource-management.nix \
14+
--arg busybox "$busybox" \
15+
--out-link "$TEST_ROOT/result-from-remote" \
16+
--store "$TEST_ROOT/local" \
17+
--builders "ssh-ng://localhost?system-features=test - - 4 1 test:4"
18+
19+
grepQuiet 'Hello World!' < "$TEST_ROOT/result-from-remote/hello"
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# shellcheck shell=bash
2+
3+
# All variables should be defined externally by the scripts that source
4+
# this, `set -u` will catch any that are forgotten.
5+
# shellcheck disable=SC2154
6+
7+
source common.sh
8+
9+
enableFeatures "resource-management"
10+
11+
TODO_NixOS
12+
restartDaemon
13+
14+
file=resource-management.nix
15+
16+
requireSandboxSupport
17+
requiresUnprivilegedUserNamespaces
18+
[[ "$busybox" =~ busybox ]] || skipTest "no busybox"
19+
20+
unset NIX_STORE_DIR
21+
22+
remoteDir=$TEST_ROOT/remote
23+
24+
here=$(readlink -f "$(dirname "${BASH_SOURCE[0]}")")
25+
export NIX_USER_CONF_FILES=$here/config/nix-with-resource-management.conf
26+
27+
# Note: ssh{-ng}://localhost bypasses ssh. See tests/functional/build-remote.sh for
28+
# more details.
29+
nix-build "$file" -o "$TEST_ROOT/result" --max-jobs 0 \
30+
--arg busybox "$busybox" \
31+
--store "$TEST_ROOT/local" \
32+
--builders "ssh-ng://localhost?system-features=test&remote-store=${remoteDir} - - 4 1 test:4"

0 commit comments

Comments
 (0)