Skip to content

Commit abc7293

Browse files
committed
RandomnWithDistinctSleds region allocation strategy
PR #3650 introduced the Random region allocation strategy to allocate regions randomly across the rack. This expands on that with the addition of the RandomWithDistinctSleds region allocation strategy. This strategy is the same, but requires the 3 crucible regions be allocated on 3 different sleds to improve resiliency against a whole-sled failure. The Random strategy still exists, and does not require 3 distinct sleds. This is useful in one-sled environments such as the integration tests, and lab setups. This PR adds the ability to configure the allocation strategy in the Nexus PackageConfig toml. Anyone running in a one-sled setup will need to configure that to one-sled mode (as is done for the integration test environment). This also fixes a shortcoming of #3650 whereby multiple datasets on a single zpool could be selected. That fix applies to both the old Random strategy and the new RandomWithDistinctSleds strategy. `smf/nexus/config-partial.toml` is configured for RandomWithDistinctSleds, as that is what we want to use on prod. As I mentioned, the integration tests are not using the distinct sleds allocation strategy. I attempted to add 2 extra sleds to the simulated environment but found that this broke more things than I had the understanding to fix in this PR. It would be nice in the future for the sim environment to have 3 sleds in it though, not just for this but for anything else that might have different behaviors in a multi-sled setup. In the present, I have unit tests that verify the allocation behavior works correctly with cockroachdb, and we can try it out on dogfood.
1 parent 9b1867b commit abc7293

File tree

11 files changed

+442
-171
lines changed

11 files changed

+442
-171
lines changed

common/src/nexus_config.rs

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -372,6 +372,8 @@ pub struct PackageConfig {
372372
pub dendrite: HashMap<SwitchLocation, DpdConfig>,
373373
/// Background task configuration
374374
pub background_tasks: BackgroundTaskConfig,
375+
/// Default Crucible region allocation strategy
376+
pub default_region_allocation_strategy: RegionAllocationStrategy,
375377
}
376378

377379
#[derive(Clone, Debug, PartialEq, Deserialize, Serialize)]
@@ -594,6 +596,9 @@ mod test {
594596
dns_external.period_secs_propagation = 7
595597
dns_external.max_concurrent_server_updates = 8
596598
external_endpoints.period_secs = 9
599+
[default_region_allocation_strategy]
600+
type = "random"
601+
seed = 0
597602
"##,
598603
)
599604
.unwrap();
@@ -677,6 +682,10 @@ mod test {
677682
period_secs: Duration::from_secs(9),
678683
}
679684
},
685+
default_region_allocation_strategy:
686+
crate::nexus_config::RegionAllocationStrategy::Random {
687+
seed: Some(0)
688+
}
680689
},
681690
}
682691
);
@@ -724,6 +733,8 @@ mod test {
724733
dns_external.period_secs_propagation = 7
725734
dns_external.max_concurrent_server_updates = 8
726735
external_endpoints.period_secs = 9
736+
[default_region_allocation_strategy]
737+
type = "random"
727738
"##,
728739
)
729740
.unwrap();
@@ -894,3 +905,30 @@ mod test {
894905
);
895906
}
896907
}
908+
909+
/// Defines a strategy for choosing what physical disks to use when allocating
910+
/// new crucible regions.
911+
///
912+
/// NOTE: More strategies can - and should! - be added.
913+
///
914+
/// See <https://rfd.shared.oxide.computer/rfd/0205> for a more
915+
/// complete discussion.
916+
///
917+
/// Longer-term, we should consider:
918+
/// - Storage size + remaining free space
919+
/// - Sled placement of datasets
920+
/// - What sort of loads we'd like to create (even split across all disks
921+
/// may not be preferable, especially if maintenance is expected)
922+
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
923+
#[serde(tag = "type", rename_all = "snake_case")]
924+
pub enum RegionAllocationStrategy {
925+
/// Choose disks pseudo-randomly. An optional seed may be provided to make
926+
/// the ordering deterministic, otherwise the current time in nanoseconds
927+
/// will be used. Ordering is based on sorting the output of `md5(UUID of
928+
/// candidate dataset + seed)`. The seed does not need to come from a
929+
/// cryptographically secure source.
930+
Random { seed: Option<u64> },
931+
932+
/// Like Random, but ensures that each region is allocated on its own sled.
933+
RandomWithDistinctSleds { seed: Option<u64> },
934+
}

nexus/db-model/src/queries/region_allocation.rs

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,13 @@ table! {
4747
}
4848
}
4949

50+
table! {
51+
shuffled_candidate_datasets {
52+
id -> Uuid,
53+
pool_id -> Uuid,
54+
}
55+
}
56+
5057
table! {
5158
candidate_regions {
5259
id -> Uuid,
@@ -89,6 +96,19 @@ table! {
8996
}
9097
}
9198

99+
table! {
100+
one_zpool_per_sled (pool_id) {
101+
pool_id -> Uuid
102+
}
103+
}
104+
105+
table! {
106+
one_dataset_per_zpool {
107+
id -> Uuid,
108+
pool_id -> Uuid
109+
}
110+
}
111+
92112
table! {
93113
inserted_regions {
94114
id -> Uuid,
@@ -141,6 +161,7 @@ diesel::allow_tables_to_appear_in_same_query!(
141161
);
142162

143163
diesel::allow_tables_to_appear_in_same_query!(old_regions, dataset,);
164+
diesel::allow_tables_to_appear_in_same_query!(old_regions, zpool,);
144165

145166
diesel::allow_tables_to_appear_in_same_query!(
146167
inserted_regions,
@@ -149,6 +170,7 @@ diesel::allow_tables_to_appear_in_same_query!(
149170

150171
diesel::allow_tables_to_appear_in_same_query!(candidate_zpools, dataset,);
151172
diesel::allow_tables_to_appear_in_same_query!(candidate_zpools, zpool,);
173+
diesel::allow_tables_to_appear_in_same_query!(candidate_datasets, dataset);
152174

153175
// == Needed for random region allocation ==
154176

0 commit comments

Comments
 (0)