-
Notifications
You must be signed in to change notification settings - Fork 60
[reconfigurator] Planner should wait for all MGS updates before proceeding to zones #9001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reconfigurator] Planner should wait for all MGS updates before proceeding to zones #9001
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough review @davepacheco! I think I've addressed all of your comments. The update zone tests will need to be modified a bit since #8921 is set on auto-merge and will break the testing here. But other than that, I think this is ready to go!
| let PlannedMgsUpdates { | ||
| pending_updates: updates, | ||
| pending_host_phase_2_changes: mut host_phase_2, | ||
| skipped_mgs_updates: mut skipped_updates, | ||
| } = try_make_update(log, board, inventory, current_artifacts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment, hope it's enough to clear things up?
| inventory: &Collection, | ||
| current_artifacts: &TufRepoDescription, | ||
| ) -> Option<(PendingMgsUpdate, PendingHostPhase2Changes)> { | ||
| ) -> Result< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't necessarily need to change this on this PR, but this type is pretty beefy. I wonder if it would be clearer for us to define an enum with three states; something like (names are hard)
enum UpdateAttempt {
NoUpdateNeeded,
Planned(PendingMgsUpdate, PendingHostPhase2Changes),
Error(FailedMgsUpdateReason),
}so details like "Ok(None) means there's no update needed" are more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea! I've left some TODOs to finish up in a follow up PR in cb197db
| baseboard_id, | ||
| ); | ||
| return None; | ||
| return Err(FailedMgsUpdateReason::NoMatchingArtifactFound); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some context to NoMatchingArtifactFound? It may be obvious for the other kinds of updates, but for the host it looks like we've lost whether the problem is phase 1 or phase 2.
Another thought, although this might be nonsense after reading the rest of the PR - we could potentially have different error types for each kind of update, so we could have more host-specific error variants, and then combine the error types in a higher-level enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I'll address this in the follow up PR that will break up FailedMgsUpdateReason cb197db
| writeln!(f, "{}", table)?; | ||
| } | ||
|
|
||
| // TODO-K: Add skipped updates in a follow up PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "skipped updates" mean in the context of the difference between two blueprints?
| )] | ||
| #[serde(rename_all = "snake_case")] | ||
| #[serde(tag = "type", content = "value")] | ||
| pub enum FailedMgsUpdateReason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I more strongly like the idea of breaking this down into something like
pub enum FailedMgsUpdateReason {
RotBootloader(FailedRotBootloaderUpdateReason),
Rot(FailedRotUpdateReason),
Sp(FailedSpUpdateReason),
Host(FailedHostUpdateReason),
}If we did this, we might not even need to track component in BlockedMgsUpdate, because we could infer it from which reason variant we have?
This is a style thing that could definitely be done separately though; wouldn't affect the real work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea a lot! Will implement in a follow up PR cb197db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // necessary. | ||
| return Some((update, PendingHostPhase2Changes::empty())); | ||
| type UpdateResult = Result<Option<PendingMgsUpdate>, FailedMgsUpdateReason>; | ||
| let attempts: [(MgsUpdateComponent, UpdateResult); 3] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as written this eagerly calls all three update functions, but then throws away the result of the later ones if an earlier one returns Some(_), right? I think this might be clearer as a for loop with a match; roughly
for component in [MgsUpdateComponent::RotBootloader, /* ... the rest ... */] {
let attempt = match component {
MgsUpdateComponent::RotBootloader => try_make_update_rot_bootloader(..),
// ... the rest ...
};
// handle attempt - either return or continue to the next component
}which I think might also let us include the host updates in this loop without having to duplicate some of the boilerplate below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks! I updated the code, and it's a little verbose because the return type for the host OS update function is different than the others, but I think it's pretty clear regardless. Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could squish all most all the repetition out by having the host OS arm convert its tuple into something that matches the other arms; this diff should compile and I think be equivalent? https://gist.github.com/jgallagher/1ab46a641af92f5a372b41a0521ed99e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, true! Done :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I think I've addressed all of the comments
Before any zones are updated, we need to make sure all MGS driven updates have succeeded and/or are at the correct version. This PR makes sure this is so, by bailing out if any MGS driven updates failed. All skipped updates are also saved as part of the planner reports
Closes: #8285
TODO: