[reconfigurator] Planner should wait for all MGS updates before proceeding to zones #9001

karencfv · 2025-09-05T05:07:57Z

Before any zones are updated, we need to make sure all MGS driven updates have succeeded and/or are at the correct version. This PR makes sure this is so, by bailing out if any MGS driven updates failed. All skipped updates are also saved as part of the planner reports

Closes: #8285

TODO:

Fix failing zone tests
Better testing

karencfv

Thanks for the thorough review @davepacheco! I think I've addressed all of your comments. The update zone tests will need to be modified a bit since #8921 is set on auto-merge and will break the testing here. But other than that, I think this is ready to go!

nexus/reconfigurator/planning/src/mgs_updates/host_phase_1.rs

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

karencfv · 2025-09-23T07:16:22Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+        let PlannedMgsUpdates {
+            pending_updates: updates,
+            pending_host_phase_2_changes: mut host_phase_2,
+            skipped_mgs_updates: mut skipped_updates,
+        } = try_make_update(log, board, inventory, current_artifacts);


Left a comment, hope it's enough to clear things up?

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

nexus/reconfigurator/planning/src/planner.rs

nexus/types/src/deployment/planning_report.rs

jgallagher · 2025-09-23T16:14:13Z

nexus/reconfigurator/planning/src/mgs_updates/host_phase_1.rs

    inventory: &Collection,
    current_artifacts: &TufRepoDescription,
-) -> Option<(PendingMgsUpdate, PendingHostPhase2Changes)> {
+) -> Result<


We don't necessarily need to change this on this PR, but this type is pretty beefy. I wonder if it would be clearer for us to define an enum with three states; something like (names are hard)

enum UpdateAttempt { NoUpdateNeeded, Planned(PendingMgsUpdate, PendingHostPhase2Changes), Error(FailedMgsUpdateReason), }

so details like "Ok(None) means there's no update needed" are more explicit?

I like that idea! I've left some TODOs to finish up in a follow up PR in cb197db

jgallagher · 2025-09-23T16:16:27Z

nexus/reconfigurator/planning/src/mgs_updates/host_phase_1.rs

-                    baseboard_id,
-                );
-                return None;
+                return Err(FailedMgsUpdateReason::NoMatchingArtifactFound);


Could we add some context to NoMatchingArtifactFound? It may be obvious for the other kinds of updates, but for the host it looks like we've lost whether the problem is phase 1 or phase 2.

Another thought, although this might be nonsense after reading the rest of the PR - we could potentially have different error types for each kind of update, so we could have more host-specific error variants, and then combine the error types in a higher-level enum?

That makes sense, I'll address this in the follow up PR that will break up FailedMgsUpdateReason cb197db

jgallagher · 2025-09-23T16:20:35Z

nexus/types/src/deployment/blueprint_diff.rs

            writeln!(f, "{}", table)?;
        }

+        // TODO-K: Add skipped updates in a follow up PR


What does "skipped updates" mean in the context of the difference between two blueprints?

nexus/types/src/deployment/planning_report.rs

jgallagher · 2025-09-23T16:24:26Z

nexus/types/src/deployment/planning_report.rs

+)]
+#[serde(rename_all = "snake_case")]
+#[serde(tag = "type", content = "value")]
+pub enum FailedMgsUpdateReason {


I think I more strongly like the idea of breaking this down into something like

pub enum FailedMgsUpdateReason { RotBootloader(FailedRotBootloaderUpdateReason), Rot(FailedRotUpdateReason), Sp(FailedSpUpdateReason), Host(FailedHostUpdateReason), }

If we did this, we might not even need to track component in BlockedMgsUpdate, because we could infer it from which reason variant we have?

This is a style thing that could definitely be done separately though; wouldn't affect the real work here.

I like this idea a lot! Will implement in a follow up PR cb197db

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

jgallagher · 2025-09-23T16:33:39Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

-        // necessary.
-        return Some((update, PendingHostPhase2Changes::empty()));
+    type UpdateResult = Result<Option<PendingMgsUpdate>, FailedMgsUpdateReason>;
+    let attempts: [(MgsUpdateComponent, UpdateResult); 3] = [


I think as written this eagerly calls all three update functions, but then throws away the result of the later ones if an earlier one returns Some(_), right? I think this might be clearer as a for loop with a match; roughly

for component in [MgsUpdateComponent::RotBootloader, /* ... the rest ... */] { let attempt = match component { MgsUpdateComponent::RotBootloader => try_make_update_rot_bootloader(..), // ... the rest ... }; // handle attempt - either return or continue to the next component }

which I think might also let us include the host updates in this loop without having to duplicate some of the boilerplate below?

Makes sense, thanks! I updated the code, and it's a little verbose because the return type for the host OS update function is different than the others, but I think it's pretty clear regardless. Let me know what you think!

I think we could squish all most all the repetition out by having the host OS arm convert its tuple into something that matches the other arms; this diff should compile and I think be equivalent? https://gist.github.com/jgallagher/1ab46a641af92f5a372b41a0521ed99e

Oh yeah, true! Done :)

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

karencfv

Thanks for the review! I think I've addressed all of the comments

Just a few style changes as suggested in #9001 Closes: #9068

karencfv added 30 commits August 29, 2025 16:04

scaffolding

5b0d884

Some thoughts

477cb9b

Implement functionality for RoT bootloader

9f4968d

Plumb skipped updates through

b8766df

populate SkippedMgsUpdates

0e5bd57

remove unnecessary checks

0c5197a

do the todos

5a1a74f

expectorate

60240f2

Make SkippedMgsUpdates a vec because we need all records

ad2e83e

fix no pending updates bug

36c788e

Clean up

ed81591

error type clean up

4efa776

clean up tuple mess

14a43ac

clean up

b040913

fully working sample

70f8495

refactor try_make_update

e1ea3d2

use builder pattern

f2c7c2a

remove unnecessary struct

3293c4f

improve error messages

6d866d6

clippy

e8f027b

Make the tests pass

1da94b0

Fix openapi generation

d0d37e6

Mull over tests

03b7432

at least the error is different now 😑

4f15d18

Fix test_update_boundary_ntp and test_update_crucible_pantry

dd8f911

fix test_update_cockroach

23e8f54

finally all the tests pass

f47e1cd

merge main

4fb55c4

fix after merge

34a3906

clean up

0781a21

karencfv added 4 commits September 23, 2025 18:08

Bail on failed update and improve testing

cc8e07a

address style comments

2444040

Get rid of SkippedMgsUpdates

19e8e26

use blocked instead of skipped

a78d9b1

karencfv commented Sep 23, 2025

View reviewed changes

jgallagher reviewed Sep 23, 2025

View reviewed changes

karencfv added 10 commits September 24, 2025 15:59

address comments

1298942

Merge main

b29203f

expectorate

45f9dbd

tests are passing 🎉

7d9d1a0

remove unnecessary blueprint updates

0e367a9

jfc merge again

4d18593

generate openapi spec

7e4e36e

fmt

f921c50

add the todos

cb197db

fmt

01391ed

karencfv commented Sep 24, 2025

View reviewed changes

karencfv requested review from davepacheco and jgallagher September 24, 2025 07:14

This was referenced Sep 24, 2025

[reconfigurator-cli] Add testing for blocked MGS driven updates #9067

Closed

[reconfigurator] Improve blocked MGS driven update code #9068

Closed

karencfv added 4 commits September 25, 2025 09:56

Address comments

52d399f

merge main

46ef2b2

fixes, expectorations, and openapi doc gen after merge with main

4ca45ba

fmt 😑

536996d

jgallagher approved these changes Sep 25, 2025

View reviewed changes

karencfv merged commit 227d9df into oxidecomputer:main Sep 25, 2025
18 checks passed

karencfv deleted the planner-update-zones-after-sp branch September 25, 2025 20:49

karencfv mentioned this pull request Sep 30, 2025

[reconfigurator] Refactor MGS driven updates #9118

Merged

karencfv added a commit that referenced this pull request Oct 2, 2025

[reconfigurator] Refactor MGS driven updates (#9118)

f2a3273

Just a few style changes as suggested in #9001 Closes: #9068

[reconfigurator] Planner should wait for all MGS updates before proceeding to zones #9001

[reconfigurator] Planner should wait for all MGS updates before proceeding to zones #9001

Uh oh!

Conversation

karencfv commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karencfv commented Sep 5, 2025 •

edited

Loading

jgallagher Sep 24, 2025 •

edited

Loading