-
Notifications
You must be signed in to change notification settings - Fork 59
incorporate SP updates into planner #8269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also remove now-superfluous `OrderedComponent` enum.
| if let UpdateStepResult::ContinueToNextStep = | ||
| self.do_plan_mgs_updates()? | ||
| { | ||
| self.do_plan_zone_updates()?; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be nitpicky, but it seems a little weird that if we don't get ContinueToNextStep, we skip the immediate next step but still do the rest of planning (which admittedly is not very much). Maybe this should be specific to "can we do further updates"? (#8284 / #8285 / #8298 are all closely related, so also fine to defer this until we do some combination of them.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've been assuming we'd iterate on the exact control flow / pattern here as we add more steps.
| // For better or worse, switches and PSCs do not have the same idea of | ||
| // being adopted into the control plane. If they're present, they're | ||
| // part of the system, and we will update them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have an issue about this? I don't think trust quorum interacts with non-sled components at all, but I think in the fullness of time we want some kind of auth on the management network, which presumably involves the control plane being aware of PSCs/switches and knowing whether or not it's okay to talk to them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly could. I'm not sure exactly what I'd file at this point. Something like: "eventually we will have the business requirement to lock down this network, and when we do, we'll have to better manage the lifecycle of these components". Or "lifecycle of switches and PSCs could be controlled, like sleds".
I know @bnaecker has been thinking about this a bit in the context of multi-rack.
| &included_baseboards, | ||
| ¤t_updates, | ||
| current_artifacts, | ||
| 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment for what this is, or put it in a constant? I assume it's "number of upgrades to attempt"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Added in 29df3f6.
| UpdateStepResult::Waiting | ||
| }; | ||
| self.blueprint.pending_mgs_updates_replace_all(next); | ||
| Ok(rv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like it's possible for this step to fail - do you think it will be in the future? Or could we change the return type to just UpdateStepResult?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to not fail in 29df3f6.
(I had it this way because I thought it was clearer to have the planning functions have the same signature, and because I think it's possible we will want to allow this to fail, but I think we'll work this out in the follow-on PRs like #8284 so I'm happy to keep it simpler for now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've addressed the feedback and will enable auto-merge.
| if let UpdateStepResult::ContinueToNextStep = | ||
| self.do_plan_mgs_updates()? | ||
| { | ||
| self.do_plan_zone_updates()?; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've been assuming we'd iterate on the exact control flow / pattern here as we add more steps.
| &included_baseboards, | ||
| ¤t_updates, | ||
| current_artifacts, | ||
| 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Added in 29df3f6.
| UpdateStepResult::Waiting | ||
| }; | ||
| self.blueprint.pending_mgs_updates_replace_all(next); | ||
| Ok(rv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to not fail in 29df3f6.
(I had it this way because I thought it was clearer to have the planning functions have the same signature, and because I think it's possible we will want to allow this to fail, but I think we'll work this out in the follow-on PRs like #8284 so I'm happy to keep it simpler for now.)
| // For better or worse, switches and PSCs do not have the same idea of | ||
| // being adopted into the control plane. If they're present, they're | ||
| // part of the system, and we will update them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly could. I'm not sure exactly what I'd file at this point. Something like: "eventually we will have the business requirement to lock down this network, and when we do, we'll have to better manage the lifecycle of these components". Or "lifecycle of switches and PSCs could be controlled, like sleds".
I know @bnaecker has been thinking about this a bit in the context of multi-rack.
PR #8269 added CRDB tables for storing ereports received from both service processors and the sled host OS. These ereports are generated to indicate a fault or other important event, so they contain information that's probably worth including in service bundles. So we should do that. This branch adds code to the `SupportBundleCollector` background task for querying the database for ereports and putting them in the bundle. This, in turn, required adding code for querying ereports over a specified time range. The `BundleRequest` can be constructed with a set of filters for ereports, including the time window, and a list of serial numbers to collect ereports from. Presently, we always just use the default: we collect ereports from all serial numbers from the last 7 days prior to bundle collection. But, I anticipate that this will be used more in the future when we add a notion of targeted support bundles: for instance, if we generate a support bundle for a particular sled, we would probably only grab ereports from that sled. Ereports are stored in an `ereports` directory in the bundle, with subdirectories for each serial number that emitted an ereport. Each serial number directory has a subdirectory for each ereport restart ID of that serial, and the individual ereports are stored within the restart ID directory as JSON files. The path to an individual ereport will be `ereports/${SERIAL_NUMBER}/${RESTART_ID}/${ENA}.json`. I'm open to changing this organization scheme if others think there's a better approach --- for example, we could place the restart ID in the filename rather than in a subdirectory if that would be more useful. Ereport collection is done in parallel to the rest of the support bundle collection by spawning Tokio tasks to collect host OS and service processor ereports. `tokio_util::task::AbortOnDropHandle` is used to wrap the `JoinHandle`s for these tasks to ensure they're aborted if the ereport collection future is dropped, so that we stop collecting ereports if the support bundle is cancelled. Fixes #8649
This is the first of a sequence of PRs that fully integrates planning of SP updates into the planner. There's:
reconfigurator-clishould support setting target release #8283Testing for this is in #8283 in the form of a
reconfigurator-clitest.Depends on #8024. Fixes #7414 and #7819.
This branch is also sync'd up with "main" in order to get #8261 -- hence most of these commits.Still to do here:[x] filed planner: do not update any zones until we're sure SPs are updated #8285.[x] add/update tests (these are now inreconfigurator-clishould support setting target release #8283)[x] prototypedreconfigurator-clisupport (in reconfigurator-cli could let you set SP versions #8273 +reconfigurator-clishould support setting target release #8283)