Skip to content

Conversation

@davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Jun 4, 2025

This is the first of a sequence of PRs that fully integrates planning of SP updates into the planner. There's:

Testing for this is in #8283 in the form of a reconfigurator-cli test.

Depends on #8024. Fixes #7414 and #7819.

This branch is also sync'd up with "main" in order to get #8261 -- hence most of these commits.

Still to do here:

@davepacheco davepacheco self-assigned this Jun 4, 2025
@davepacheco
Copy link
Collaborator Author

With #8273 and #8283, I'm able to use reconfigurator-cli to execute several blueprint planning steps that together update all the SPs in a (simulated) rack. I've added a test for this in #8283. That basically makes up the testing for this PR.

@davepacheco davepacheco marked this pull request as ready for review June 5, 2025 19:39
Base automatically changed from plan-target-release to main June 6, 2025 21:41
Comment on lines 125 to 129
if let UpdateStepResult::ContinueToNextStep =
self.do_plan_mgs_updates()?
{
self.do_plan_zone_updates()?;
}
Copy link
Contributor

@jgallagher jgallagher Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be nitpicky, but it seems a little weird that if we don't get ContinueToNextStep, we skip the immediate next step but still do the rest of planning (which admittedly is not very much). Maybe this should be specific to "can we do further updates"? (#8284 / #8285 / #8298 are all closely related, so also fine to defer this until we do some combination of them.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've been assuming we'd iterate on the exact control flow / pattern here as we add more steps.

Comment on lines +926 to +928
// For better or worse, switches and PSCs do not have the same idea of
// being adopted into the control plane. If they're present, they're
// part of the system, and we will update them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an issue about this? I don't think trust quorum interacts with non-sled components at all, but I think in the fullness of time we want some kind of auth on the management network, which presumably involves the control plane being aware of PSCs/switches and knowing whether or not it's okay to talk to them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly could. I'm not sure exactly what I'd file at this point. Something like: "eventually we will have the business requirement to lock down this network, and when we do, we'll have to better manage the lifecycle of these components". Or "lifecycle of switches and PSCs could be controlled, like sleds".

I know @bnaecker has been thinking about this a bit in the context of multi-rack.

&included_baseboards,
&current_updates,
current_artifacts,
1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment for what this is, or put it in a constant? I assume it's "number of upgrades to attempt"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Added in 29df3f6.

UpdateStepResult::Waiting
};
self.blueprint.pending_mgs_updates_replace_all(next);
Ok(rv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like it's possible for this step to fail - do you think it will be in the future? Or could we change the return type to just UpdateStepResult?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to not fail in 29df3f6.

(I had it this way because I thought it was clearer to have the planning functions have the same signature, and because I think it's possible we will want to allow this to fail, but I think we'll work this out in the follow-on PRs like #8284 so I'm happy to keep it simpler for now.)

Copy link
Collaborator Author

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've addressed the feedback and will enable auto-merge.

Comment on lines 125 to 129
if let UpdateStepResult::ContinueToNextStep =
self.do_plan_mgs_updates()?
{
self.do_plan_zone_updates()?;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've been assuming we'd iterate on the exact control flow / pattern here as we add more steps.

&included_baseboards,
&current_updates,
current_artifacts,
1,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Added in 29df3f6.

UpdateStepResult::Waiting
};
self.blueprint.pending_mgs_updates_replace_all(next);
Ok(rv)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to not fail in 29df3f6.

(I had it this way because I thought it was clearer to have the planning functions have the same signature, and because I think it's possible we will want to allow this to fail, but I think we'll work this out in the follow-on PRs like #8284 so I'm happy to keep it simpler for now.)

Comment on lines +926 to +928
// For better or worse, switches and PSCs do not have the same idea of
// being adopted into the control plane. If they're present, they're
// part of the system, and we will update them.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly could. I'm not sure exactly what I'd file at this point. Something like: "eventually we will have the business requirement to lock down this network, and when we do, we'll have to better manage the lifecycle of these components". Or "lifecycle of switches and PSCs could be controlled, like sleds".

I know @bnaecker has been thinking about this a bit in the context of multi-rack.

@davepacheco davepacheco enabled auto-merge (squash) June 13, 2025 17:54
@davepacheco davepacheco merged commit d333576 into main Jun 13, 2025
16 checks passed
@davepacheco davepacheco deleted the dap/sp-planning-real branch June 13, 2025 19:27
hawkw added a commit that referenced this pull request Aug 1, 2025
PR #8269 added CRDB tables for storing ereports received from both
service processors and the sled host OS. These ereports are generated to
indicate a fault or other important event, so they contain information
that's probably worth including in service bundles. So we should do
that.

This branch adds code to the `SupportBundleCollector` background task
for querying the database for ereports and putting them in the bundle.
This, in turn, required adding code for querying ereports over a
specified time range. The `BundleRequest` can be constructed with a set
of filters for ereports, including the time window, and a list of serial
numbers to collect ereports from. Presently, we always just use the
default: we collect ereports from all serial numbers from the last 7
days prior to bundle collection. But, I anticipate that this will be
used more in the future when we add a notion of targeted support
bundles: for instance, if we generate a support bundle for a particular
sled, we would probably only grab ereports from that sled.

Ereports are stored in an `ereports` directory in the bundle, with
subdirectories for each serial number that emitted an ereport. Each
serial number directory has a subdirectory for each ereport restart ID
of that serial, and the individual ereports are stored within the
restart ID directory as JSON files. The path to an individual ereport
will be `ereports/${SERIAL_NUMBER}/${RESTART_ID}/${ENA}.json`. I'm open
to changing this organization scheme if others think there's a better
approach --- for example, we could place the restart ID in the filename
rather than in a subdirectory if that would be more useful.

Ereport collection is done in parallel to the rest of the support bundle
collection by spawning Tokio tasks to collect host OS and service
processor ereports. `tokio_util::task::AbortOnDropHandle` is used to
wrap the `JoinHandle`s for these tasks to ensure they're aborted if the
ereport collection future is dropped, so that we stop collecting
ereports if the support bundle is cancelled.

Fixes #8649
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

planner support for configuring SP images/slots

3 participants