Skip to content

Conversation

@karencfv
Copy link
Contributor

@karencfv karencfv commented Sep 5, 2025

Before any zones are updated, we need to make sure all MGS driven updates have succeeded and/or are at the correct version. This PR makes sure this is so, by bailing out if any MGS driven updates failed. All skipped updates are also saved as part of the planner reports

Closes: #8285

TODO:

  • Fix failing zone tests
  • Better testing

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review @davepacheco! I think I've addressed all of your comments. The update zone tests will need to be modified a bit since #8921 is set on auto-merge and will break the testing here. But other than that, I think this is ready to go!

Comment on lines 255 to 259
let PlannedMgsUpdates {
pending_updates: updates,
pending_host_phase_2_changes: mut host_phase_2,
skipped_mgs_updates: mut skipped_updates,
} = try_make_update(log, board, inventory, current_artifacts);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment, hope it's enough to clear things up?

inventory: &Collection,
current_artifacts: &TufRepoDescription,
) -> Option<(PendingMgsUpdate, PendingHostPhase2Changes)> {
) -> Result<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily need to change this on this PR, but this type is pretty beefy. I wonder if it would be clearer for us to define an enum with three states; something like (names are hard)

enum UpdateAttempt {
    NoUpdateNeeded,
    Planned(PendingMgsUpdate, PendingHostPhase2Changes),
    Error(FailedMgsUpdateReason),
}

so details like "Ok(None) means there's no update needed" are more explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea! I've left some TODOs to finish up in a follow up PR in cb197db

baseboard_id,
);
return None;
return Err(FailedMgsUpdateReason::NoMatchingArtifactFound);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some context to NoMatchingArtifactFound? It may be obvious for the other kinds of updates, but for the host it looks like we've lost whether the problem is phase 1 or phase 2.

Another thought, although this might be nonsense after reading the rest of the PR - we could potentially have different error types for each kind of update, so we could have more host-specific error variants, and then combine the error types in a higher-level enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'll address this in the follow up PR that will break up FailedMgsUpdateReason cb197db

writeln!(f, "{}", table)?;
}

// TODO-K: Add skipped updates in a follow up PR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "skipped updates" mean in the context of the difference between two blueprints?

)]
#[serde(rename_all = "snake_case")]
#[serde(tag = "type", content = "value")]
pub enum FailedMgsUpdateReason {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I more strongly like the idea of breaking this down into something like

pub enum FailedMgsUpdateReason {
    RotBootloader(FailedRotBootloaderUpdateReason),
    Rot(FailedRotUpdateReason),
    Sp(FailedSpUpdateReason),
    Host(FailedHostUpdateReason),
}

If we did this, we might not even need to track component in BlockedMgsUpdate, because we could infer it from which reason variant we have?

This is a style thing that could definitely be done separately though; wouldn't affect the real work here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea a lot! Will implement in a follow up PR cb197db

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// necessary.
return Some((update, PendingHostPhase2Changes::empty()));
type UpdateResult = Result<Option<PendingMgsUpdate>, FailedMgsUpdateReason>;
let attempts: [(MgsUpdateComponent, UpdateResult); 3] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as written this eagerly calls all three update functions, but then throws away the result of the later ones if an earlier one returns Some(_), right? I think this might be clearer as a for loop with a match; roughly

for component in [MgsUpdateComponent::RotBootloader, /* ... the rest ... */] {
    let attempt = match component {
        MgsUpdateComponent::RotBootloader => try_make_update_rot_bootloader(..),
        // ... the rest ...
    };

    // handle attempt - either return or continue to the next component
}

which I think might also let us include the host updates in this loop without having to duplicate some of the boilerplate below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks! I updated the code, and it's a little verbose because the return type for the host OS update function is different than the others, but I think it's pretty clear regardless. Let me know what you think!

Copy link
Contributor

@jgallagher jgallagher Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could squish all most all the repetition out by having the host OS arm convert its tuple into something that matches the other arms; this diff should compile and I think be equivalent? https://gist.github.com/jgallagher/1ab46a641af92f5a372b41a0521ed99e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, true! Done :)

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I think I've addressed all of the comments

@karencfv karencfv merged commit 227d9df into oxidecomputer:main Sep 25, 2025
18 checks passed
@karencfv karencfv deleted the planner-update-zones-after-sp branch September 25, 2025 20:49
karencfv added a commit that referenced this pull request Oct 2, 2025
Just a few style changes as suggested in #9001 

Closes: #9068
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

planner: do not update any zones until we're sure SPs are updated

3 participants