incorporate SP updates into planner #8269

davepacheco · 2025-06-04T18:35:17Z

This is the first of a sequence of PRs that fully integrates planning of SP updates into the planner. There's:

This PR: does SP update planning in the planner
reconfigurator-cli could let you set SP versions #8273
reconfigurator-cli should support setting target release #8283

Testing for this is in #8283 in the form of a reconfigurator-cli test.

Depends on #8024. Fixes #7414 and #7819.

~~This branch is also sync'd up with "main" in order to get #8261 -- hence most of these commits.~~

~~Still to do here:~~

~~[x] filed planner: do not update any zones until we're sure SPs are updated #8285.~~
~~[x] add/update tests (these are now in reconfigurator-cli should support setting target release #8283)~~
~~[x] prototyped reconfigurator-cli support (in reconfigurator-cli could let you set SP versions #8273 + reconfigurator-cli should support setting target release #8283)~~

Also remove now-superfluous `OrderedComponent` enum.

…-planning-real

davepacheco · 2025-06-05T19:18:17Z

With #8273 and #8283, I'm able to use reconfigurator-cli to execute several blueprint planning steps that together update all the SPs in a (simulated) rack. I've added a test for this in #8283. That basically makes up the testing for this PR.

…lanning-real

jgallagher · 2025-06-13T12:51:55Z

nexus/reconfigurator/planning/src/planner.rs

+        if let UpdateStepResult::ContinueToNextStep =
+            self.do_plan_mgs_updates()?
+        {
+            self.do_plan_zone_updates()?;
+        }


This might be nitpicky, but it seems a little weird that if we don't get ContinueToNextStep, we skip the immediate next step but still do the rest of planning (which admittedly is not very much). Maybe this should be specific to "can we do further updates"? (#8284 / #8285 / #8298 are all closely related, so also fine to defer this until we do some combination of them.)

Yeah, I've been assuming we'd iterate on the exact control flow / pattern here as we add more steps.

jgallagher · 2025-06-13T12:54:00Z

nexus/reconfigurator/planning/src/planner.rs

+        // For better or worse, switches and PSCs do not have the same idea of
+        // being adopted into the control plane.  If they're present, they're
+        // part of the system, and we will update them.


Should we have an issue about this? I don't think trust quorum interacts with non-sled components at all, but I think in the fullness of time we want some kind of auth on the management network, which presumably involves the control plane being aware of PSCs/switches and knowing whether or not it's okay to talk to them?

We certainly could. I'm not sure exactly what I'd file at this point. Something like: "eventually we will have the business requirement to lock down this network, and when we do, we'll have to better manage the lifecycle of these components". Or "lifecycle of switches and PSCs could be controlled, like sleds".

I know @bnaecker has been thinking about this a bit in the context of multi-rack.

jgallagher · 2025-06-13T12:56:19Z

nexus/reconfigurator/planning/src/planner.rs

+            &included_baseboards,
+            &current_updates,
+            current_artifacts,
+            1,


Can we add a comment for what this is, or put it in a constant? I assume it's "number of upgrades to attempt"?

Good call. Added in 29df3f6.

jgallagher · 2025-06-13T12:57:26Z

nexus/reconfigurator/planning/src/planner.rs

+            UpdateStepResult::Waiting
+        };
+        self.blueprint.pending_mgs_updates_replace_all(next);
+        Ok(rv)


It doesn't look like it's possible for this step to fail - do you think it will be in the future? Or could we change the return type to just UpdateStepResult?

I've changed it to not fail in 29df3f6.

(I had it this way because I thought it was clearer to have the planning functions have the same signature, and because I think it's possible we will want to allow this to fail, but I think we'll work this out in the follow-on PRs like #8284 so I'm happy to keep it simpler for now.)

davepacheco

Thanks! I've addressed the feedback and will enable auto-merge.

davepacheco · 2025-06-13T17:45:37Z

nexus/reconfigurator/planning/src/planner.rs

+        if let UpdateStepResult::ContinueToNextStep =
+            self.do_plan_mgs_updates()?
+        {
+            self.do_plan_zone_updates()?;
+        }


Yeah, I've been assuming we'd iterate on the exact control flow / pattern here as we add more steps.

davepacheco · 2025-06-13T17:46:31Z

nexus/reconfigurator/planning/src/planner.rs

+            &included_baseboards,
+            &current_updates,
+            current_artifacts,
+            1,


Good call. Added in 29df3f6.

davepacheco · 2025-06-13T17:47:49Z

nexus/reconfigurator/planning/src/planner.rs

+            UpdateStepResult::Waiting
+        };
+        self.blueprint.pending_mgs_updates_replace_all(next);
+        Ok(rv)


I've changed it to not fail in 29df3f6.

(I had it this way because I thought it was clearer to have the planning functions have the same signature, and because I think it's possible we will want to allow this to fail, but I think we'll work this out in the follow-on PRs like #8284 so I'm happy to keep it simpler for now.)

davepacheco · 2025-06-13T17:52:28Z

nexus/reconfigurator/planning/src/planner.rs

+        // For better or worse, switches and PSCs do not have the same idea of
+        // being adopted into the control plane.  If they're present, they're
+        // part of the system, and we will update them.


We certainly could. I'm not sure exactly what I'd file at this point. Something like: "eventually we will have the business requirement to lock down this network, and when we do, we'll have to better manage the lifecycle of these components". Or "lifecycle of switches and PSCs could be controlled, like sleds".

I know @bnaecker has been thinking about this a bit in the context of multi-rack.

PR #8269 added CRDB tables for storing ereports received from both service processors and the sled host OS. These ereports are generated to indicate a fault or other important event, so they contain information that's probably worth including in service bundles. So we should do that. This branch adds code to the `SupportBundleCollector` background task for querying the database for ereports and putting them in the bundle. This, in turn, required adding code for querying ereports over a specified time range. The `BundleRequest` can be constructed with a set of filters for ereports, including the time window, and a list of serial numbers to collect ereports from. Presently, we always just use the default: we collect ereports from all serial numbers from the last 7 days prior to bundle collection. But, I anticipate that this will be used more in the future when we add a notion of targeted support bundles: for instance, if we generate a support bundle for a particular sled, we would probably only grab ereports from that sled. Ereports are stored in an `ereports` directory in the bundle, with subdirectories for each serial number that emitted an ereport. Each serial number directory has a subdirectory for each ereport restart ID of that serial, and the individual ereports are stored within the restart ID directory as JSON files. The path to an individual ereport will be `ereports/${SERIAL_NUMBER}/${RESTART_ID}/${ENA}.json`. I'm open to changing this organization scheme if others think there's a better approach --- for example, we could place the restart ID in the filename rather than in a subdirectory if that would be more useful. Ereport collection is done in parallel to the rest of the support bundle collection by spawning Tokio tasks to collect host OS and service processor ereports. `tokio_util::task::AbortOnDropHandle` is used to wrap the `JoinHandle`s for these tasks to ensure they're aborted if the ereport collection future is dropped, so that we stop collecting ereports if the support bundle is cancelled. Fixes #8649

plotnick and others added 30 commits April 29, 2025 11:59

Fix docstring for BlueprintZoneImageSource

17d9ae4

Plumb target release TUF repo through planner

a87f755

Plan zone updates from TUF repo

851128a

Plan according to RFD 565 §9

85e94bb

Don't update an already up-to-date zone

194a8a3

Don't trust inventory zones' image_source

50c1169

Fix failing tests

ad7103e

Rename datastore methods: update_tuf_* → tuf_*

55b8e0e

Fix typed UUID

a7c1e17

Rename ControlPlaneZone → NonNexusOmicronZone

57f7154

Simplify OrderedComponent logic

77b2156

Simplify out-of-date zone collection

e31fc60

Make planning error an error

7f1f522

Explicitly list which zone kinds get in-place updates

050933e

Uncomment test assertion

2daabf5

Type-safe fake-zone names

88733d2

Merge branch 'main' into plan-target-release

2e0ee41

Fix doc bug from renamed method

3ae171e

Merge branch 'main' into plan-target-release

499106a

Merge branch 'main' into plan-target-release

959e1e4

Refactor zone update readiness check

3611b42

Merge branch 'main' into plan-target-release

b2f09a3

Check zone image source in inventory

70edce7

Also remove now-superfluous `OrderedComponent` enum.

initial draft: planning SP, some pieces missing

29d2093

add docs

38dc53a

fix openapi

f28b929

update comments based on review feedback

bd7bcd6

review feedback

bf2ab7c

Merge branch 'dap/sp-planning' into dap/sp-planning-real

a862939

Merge remote-tracking branch 'origin/plan-target-release' into dap/sp…

7acf01d

…-planning-real

davepacheco mentioned this pull request Jun 4, 2025

Plan zone updates for target release #8024

Merged

2 tasks

davepacheco self-assigned this Jun 4, 2025

Merge remote-tracking branch 'origin/plan-target-release' into dap/sp…

f715a9a

…-planning-real

davepacheco mentioned this pull request Jun 4, 2025

reconfigurator-cli could let you set SP versions #8273

Merged

davepacheco added 2 commits June 4, 2025 15:36

rustfmt

7705d1c

fix tests

ceaaec3

davepacheco mentioned this pull request Jun 5, 2025

planner: do not update any zones until we're sure SPs are updated #8285

Closed

remove XXX

709ae6d

davepacheco marked this pull request as ready for review June 5, 2025 19:39

davepacheco requested review from jgallagher and plotnick June 5, 2025 19:51

Base automatically changed from plan-target-release to main June 6, 2025 21:41

davepacheco added 3 commits June 6, 2025 16:21

Merge commit '2d077d531260590a56908d0ee0186f38c0d96d7a' into dap/sp-p…

ad44964

…lanning-real

Merge commit 'd2aa2c5f207b9918fab30a7b95790b6bfd29d75d' into dap/sp-p…

52a368e

…lanning-real

Merge branch 'main' into dap/sp-planning-real

51a41f1

jgallagher mentioned this pull request Jun 10, 2025

Planner flip-flops on which zone to update #8298

Open

davepacheco added 2 commits June 11, 2025 09:47

Merge branch 'main' into dap/sp-planning-real

ce355a2

fix test

ac66c42

jgallagher approved these changes Jun 13, 2025

View reviewed changes

davepacheco added 2 commits June 13, 2025 10:26

Merge branch 'main' into dap/sp-planning-real

405b29c

review feedback

29df3f6

davepacheco commented Jun 13, 2025

View reviewed changes

davepacheco enabled auto-merge (squash) June 13, 2025 17:54

davepacheco merged commit d333576 into main Jun 13, 2025
16 checks passed

davepacheco deleted the dap/sp-planning-real branch June 13, 2025 19:27

hawkw mentioned this pull request Jul 20, 2025

snarf ereports into service bundles #8649

Closed

davepacheco mentioned this pull request Jul 24, 2025

Do not update more than one RoT stage0 at a time in a rack to minimize risk. #7819

Closed

hawkw mentioned this pull request Jul 31, 2025

[nexus] Snarf ereports from CRDB into support bundles #8739

Merged

incorporate SP updates into planner #8269

incorporate SP updates into planner #8269

Uh oh!

Conversation

davepacheco commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Jun 5, 2025

Uh oh!

jgallagher Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davepacheco commented Jun 4, 2025 •

edited

Loading

jgallagher Jun 13, 2025 •

edited

Loading