-
Couldn't load subscription status.
- Fork 60
incorporate SP updates into planner #8269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 42 commits
17d9ae4
a87f755
851128a
85e94bb
194a8a3
50c1169
ad7103e
55b8e0e
a7c1e17
57f7154
77b2156
e31fc60
7f1f522
050933e
2daabf5
88733d2
2e0ee41
3ae171e
499106a
959e1e4
3611b42
b2f09a3
70edce7
29d2093
38dc53a
f28b929
bd7bcd6
bf2ab7c
a862939
7acf01d
8182a46
78fbd8c
aa00005
f715a9a
7705d1c
ceaaec3
709ae6d
ad44964
52a368e
51a41f1
ce355a2
ac66c42
405b29c
29df3f6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,7 +13,9 @@ use crate::blueprint_builder::Error; | |
| use crate::blueprint_builder::Operation; | ||
| use crate::blueprint_editor::DisksEditError; | ||
| use crate::blueprint_editor::SledEditError; | ||
| use crate::mgs_updates::plan_mgs_updates; | ||
| use crate::planner::omicron_zone_placement::PlacementError; | ||
| use gateway_client::types::SpType; | ||
| use nexus_sled_agent_shared::inventory::OmicronZoneType; | ||
| use nexus_sled_agent_shared::inventory::ZoneKind; | ||
| use nexus_types::deployment::Blueprint; | ||
|
|
@@ -50,6 +52,11 @@ pub use self::rng::SledPlannerRng; | |
| mod omicron_zone_placement; | ||
| pub(crate) mod rng; | ||
|
|
||
| enum UpdateStepResult { | ||
| ContinueToNextStep, | ||
| Waiting, | ||
| } | ||
|
|
||
| pub struct Planner<'a> { | ||
| log: Logger, | ||
| input: &'a PlanningInput, | ||
|
|
@@ -115,7 +122,11 @@ impl<'a> Planner<'a> { | |
| self.do_plan_expunge()?; | ||
| self.do_plan_add()?; | ||
| self.do_plan_decommission()?; | ||
| self.do_plan_zone_updates()?; | ||
| if let UpdateStepResult::ContinueToNextStep = | ||
| self.do_plan_mgs_updates()? | ||
| { | ||
| self.do_plan_zone_updates()?; | ||
| } | ||
| self.do_plan_cockroachdb_settings(); | ||
| Ok(()) | ||
| } | ||
|
|
@@ -901,6 +912,63 @@ impl<'a> Planner<'a> { | |
| Ok(()) | ||
| } | ||
|
|
||
| /// Update at most one MGS-managed device (SP, RoT, etc.), if any are out of | ||
| /// date. | ||
| fn do_plan_mgs_updates(&mut self) -> Result<UpdateStepResult, Error> { | ||
| // Determine which baseboards we will consider updating. | ||
| // | ||
| // Sleds may be present but not adopted as part of the control plane. | ||
| // In deployed systems, this would probably only happen if a sled was | ||
| // about to be added. In dev/test environments, it's common to leave | ||
| // some number of sleds out of the control plane for various reasons. | ||
| // Inventory will still report them, but we don't want to touch them. | ||
| // | ||
| // For better or worse, switches and PSCs do not have the same idea of | ||
| // being adopted into the control plane. If they're present, they're | ||
| // part of the system, and we will update them. | ||
|
Comment on lines
+951
to
+953
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we have an issue about this? I don't think trust quorum interacts with non-sled components at all, but I think in the fullness of time we want some kind of auth on the management network, which presumably involves the control plane being aware of PSCs/switches and knowing whether or not it's okay to talk to them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We certainly could. I'm not sure exactly what I'd file at this point. Something like: "eventually we will have the business requirement to lock down this network, and when we do, we'll have to better manage the lifecycle of these components". Or "lifecycle of switches and PSCs could be controlled, like sleds". I know @bnaecker has been thinking about this a bit in the context of multi-rack. |
||
| let included_sled_baseboards: BTreeSet<_> = self | ||
| .input | ||
| .all_sleds(SledFilter::SpsUpdatedByReconfigurator) | ||
| .map(|(_sled_id, details)| &details.baseboard_id) | ||
| .collect(); | ||
| let included_baseboards = | ||
| self.inventory | ||
| .sps | ||
| .iter() | ||
| .filter_map(|(baseboard_id, sp_state)| { | ||
| let do_include = match sp_state.sp_type { | ||
| SpType::Sled => included_sled_baseboards | ||
| .contains(baseboard_id.as_ref()), | ||
| SpType::Power => true, | ||
| SpType::Switch => true, | ||
| }; | ||
| do_include.then_some(baseboard_id.clone()) | ||
| }) | ||
| .collect(); | ||
|
|
||
| // Compute the new set of PendingMgsUpdates. | ||
| let current_updates = | ||
| &self.blueprint.parent_blueprint().pending_mgs_updates; | ||
| let current_artifacts = self.input.tuf_repo(); | ||
| let next = plan_mgs_updates( | ||
| &self.log, | ||
| &self.inventory, | ||
| &included_baseboards, | ||
| ¤t_updates, | ||
| current_artifacts, | ||
| 1, | ||
|
||
| ); | ||
|
|
||
| // TODO This is not quite right. See oxidecomputer/omicron#8285. | ||
| let rv = if next.is_empty() { | ||
| UpdateStepResult::ContinueToNextStep | ||
| } else { | ||
| UpdateStepResult::Waiting | ||
| }; | ||
| self.blueprint.pending_mgs_updates_replace_all(next); | ||
| Ok(rv) | ||
|
||
| } | ||
|
|
||
| /// Update at most one existing zone to use a new image source. | ||
| fn do_plan_zone_updates(&mut self) -> Result<(), Error> { | ||
| // We are only interested in non-decommissioned sleds. | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be nitpicky, but it seems a little weird that if we don't get
ContinueToNextStep, we skip the immediate next step but still do the rest of planning (which admittedly is not very much). Maybe this should be specific to "can we do further updates"? (#8284 / #8285 / #8298 are all closely related, so also fine to defer this until we do some combination of them.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've been assuming we'd iterate on the exact control flow / pattern here as we add more steps.