-
Couldn't load subscription status.
- Fork 60
[reconfigurator] Pre-checks and post_update actions for RoT bootloader update #8325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reconfigurator] Pre-checks and post_update actions for RoT bootloader update #8325
Conversation
| // We'll loop for 3 minutes to wait for any ongoing RoT bootloader update. | ||
| // We need to wait for 2 resets which have a timeout of 60 seconds each, | ||
| // and an attempt to retrieve boot info, which has a time out of 30 seconds. | ||
| // We give an additional 30 seconds to as a buffer for the other actions. | ||
| Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => { | ||
| if before.elapsed() | ||
| >= WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT | ||
| { | ||
| return Err(UpdateWaitError::Timeout( | ||
| WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT, | ||
| )); | ||
| } | ||
|
|
||
| tokio::time::sleep(ROT_BOOLOADER_UPDATE_PROGRESS_INTERVAL) | ||
| .await; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davepacheco is this implementation accurate with #7988 (comment) ? Or is there something I missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not. More on this in my comment above.
| // TODO-K: In the RoT bootloader update code in wicket, there is a set of | ||
| // known bootloader FWIDs that don't have cabooses. Is this something we | ||
| // should care about here? | ||
| // https://github.com/oxidecomputer/omicron/blob/89ce370f0a96165c777e90a008257a6085897f2a/wicketd/src/update_tracker.rs#L1817-L1841 | ||
|
|
||
| // TODO-K: There are also older versions of the SP have a bug that prevents | ||
| // setting the active slot for the RoT bootloader. Is this something we should | ||
| // care about here? | ||
| // https://github.com/oxidecomputer/omicron/blob/89ce370f0a96165c777e90a008257a6085897f2a/wicketd/src/update_tracker.rs#L1705-L1710 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to get some input from @davepacheco or @lzrd here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke with @lzrd IRL about this. We do want to keep these checks in place for development experience. But they're not urgent. These can be added later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "these checks", do you mean both of the above (the two comments in L64-L72)? I would have thought we could leave these out because we'd never expect to find these versions on systems where we'd be running automated update, especially if the failure mode is just that we'll not do the update.
If we want to add these because we think somehow we might see these in dev systems, can we strike these comments and file issues instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e06418e
| // TODO-K: In post_update we'll be restarting the RoT twice to do signature | ||
| // checks, and to set stage0 to the new version. What happens if the RoT | ||
| // itself is being updated (during the reset stage)? Should we check for that | ||
| // here before setting the RoT bootloader as ready to update? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this even possible? I know the planner will be doing the SP, RoT, bootloader, host OS updates sequentially. But could it be possible that a rogue nexus may attempt to do an RoT update while a bootloader one is happening or vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about a "rogue" Nexus, but I think we should assume it's always possible any given Nexus could be executing an older blueprint concurrently with a different Nexus executing a newer blueprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yeah, that makes sense. I agree.
I guess my question now is what happens if a Nexus is resetting an RoT as part of an RoT update, and another Nexus is resetting an RoT as part of an RoT bootloader update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of the prechecks should prevent this from happening, even if a Nexus is operating on a blueprint. Assuming of this is working as intended:
- The planner ensures there is at most one
PendingMgsUpdatein a given blueprint - The planner only removes a
PendingMgsUpdateif it's completed or become impossible - The prechecks of any given update prevent a Nexus from starting an update if the target isn't in the same state it was when the planner decided to perform the update
I don't think it's possible for two different Nexuses to attempt to reset two different components simultaneously:
- "reset" happens at the end of the update
- ... which means all the prechecks passed
- ... which means the update couldn't have been completed yet
- ... which means the planner couldn't have created a new blueprint with a different
PendingMgsUpdate(unless the update has become impossible, which should have caused any in-flight update to fail before it got to "reset")
Maybe there's some path through here where this is possible, but if there is it seems like something we have to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding, but that's assuming we're talking about the same blueprint version, yes? Just a comment above you mention:
... I think we should assume it's always possible any given Nexus could be executing an older blueprint concurrently with a different Nexus executing a newer blueprint.
So, we could assume it's possible a Nexus is resetting an RoT as part of an RoT update of an older blueprint, and another Nexus is resetting an RoT as part of an RoT bootloader update of a newer blueprint.
Something like:
- Nexus#1 with a blueprint with a new RoT version starts an RoT update.
- Nexus#2 with a different blueprint with a new RoT bootloader version (and no changes to the RoT) starts an update.
- Both Nexus#1 and Nexus#2 enter the post-update stage at similar times, and clash resetting the RoT
Is this possible?
If so, it would probably make sense for the RoT bootloader to have pre-checks that validate the expected state of the RoT, and the RoT bootloader to have pre-checks that validate the expected state of the RoT bootloader.
Is it overkill to add those additional checks even if we're almost certain that this scenario is near impossible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! gotcha. OK, that's reassuring, thanks.
I guess there's a question here of: in step 5
curious to know what happens in this case as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sequence does some possible. Further, it's really hard (at best) for the control plane to avoid this. It's the old: you can imagine implementing a lock, but then you have the problem of: what if Nexus#1 actually dies permanently with the lock held? That's why people use leases instead of locks, but a lease has the same problem as what we have here: something could validate its lease, then go out to lunch for a long time right before performing the action that's supposed to be protected by the lease. I guess when we revoke the lease we could use Ignition to power-cycle the sled hosting the Nexus, but that raises more questions.
Instead, we've generally opted to allow these sequences but make sure that the end result is acceptable. I think that's largely the case here, though I'm not positive. I think we have to assume that:
- if any of these devices is externally reset (or if the rack loses power) at any point in the process, the device will come up again
- whatever working state the device is in, there is a
PendingMgsUpdatethat can get it into the desired state
In that case, if two updates are stomping on each other, they might cause each others' updates to fail. But as long as they're also both trying to sync up with the latest PendingMgsUpdate, and the planner is updating the latest PendingMgsUpdate's preconditions to match reality, this should converge to the desired end state, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous that a bunch of resets setting versions of two different components will leave one of the two in a state where the device is no longer capable of updating. Is this possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all cases, if the software we actually try to deploy is itself broken (like, has logic bugs in the Hubris software), then all bets are off. Similarly, if the bits rot in the flash, all bets are off. Let's ignore those cases for now.
No matter what reset is generated or when, as long as there's good software in the currently-active slot, I'd expect the device should be able to boot up and report what is in each slot (if anything). The planner can then generate a PendingMgsUpdate whose preconditions reflect what was found and we should be able to perform that update.
For the SP: this is a straightforward two-slot approach where we don't switch to a new slot until we know it contains a copy of the software (which we assumed above to be working). So I don't see how the active slot could ever not have working software. I assume here that switching slots is atomic -- we're not copying data from one place to another, which could fail partway through and leave the destination corrupted.
For the RoT: we have the extra requirement that the signature matches. But if we start with working software in slot A, we will not update slot A again until we have working, signed software in slot B (and vice versa). So I don't know how we could ever not have working, correctly signed software in one of these slots. Similarly, I assume here that switching slots is atomic.
For the RoT bootloader: as I understood it, device will not allow us to replace stage0 unless it's validated the signature on stage0next. The switch of slots here is not atomic and it's conceivable that we lose power while that copy is happening and brick the device. But again as I understand it that's not possible as a result of a reset from us because the code that processes that reset request is busy doing the copy and won't handle it until the copy is complete.
So I don't see how we could reset the device into a state where we couldn't update it. Maybe @lzrd or @labbott could say more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you wrote sounds correct. RoT hubris will not respond to further messages from the SP until it completes the image swap for both Hubris and bootloader so those should be atomic operations (assuming no external power loss).
|
This PR will need to be updated if #8398 lands before this PR is merged |
| // TODO-K: In the RoT bootloader update code in wicket, there is a set of | ||
| // known bootloader FWIDs that don't have cabooses. Is this something we | ||
| // should care about here? | ||
| // https://github.com/oxidecomputer/omicron/blob/89ce370f0a96165c777e90a008257a6085897f2a/wicketd/src/update_tracker.rs#L1817-L1841 | ||
|
|
||
| // TODO-K: There are also older versions of the SP have a bug that prevents | ||
| // setting the active slot for the RoT bootloader. Is this something we should | ||
| // care about here? | ||
| // https://github.com/oxidecomputer/omicron/blob/89ce370f0a96165c777e90a008257a6085897f2a/wicketd/src/update_tracker.rs#L1705-L1710 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "these checks", do you mean both of the above (the two comments in L64-L72)? I would have thought we could leave these out because we'd never expect to find these versions on systems where we'd be running automated update, especially if the failure mode is just that we'll not do the update.
If we want to add these because we think somehow we might see these in dev systems, can we strike these comments and file issues instead?
| // The name for the SP component here is STAGE0 | ||
| // it's a little confusing because we're really | ||
| // trying to reach STAGE0NEXT, and there is no | ||
| // ROT_BOOTLOADER variant. We specify that we | ||
| // want STAGE0NEXT by setting the firmware slot | ||
| // to 1, which is where it will always be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've also been confused here, but I'm not sure why. Isn't this just like the SP, where you have one component ("SP"/"Stage0") and an active slot (0) and an inactive slot (1)? Is "stage0_next" just the name for "the inactive slot for stage0"? (I'm not positive about this -- I'm really asking!)
Is the confusion just that stage0next sometimes seems to be its own component, whereas the inactive slot for the SP isn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what confused me, is that in the case of the SP you have the overall component SpComponent::SP_ITSELF which has two slots 0 and 1. And for the bootloader you have the component SpComponent::STAGE0 which has two slots stage0 (0) and stage0_next (1). So the name of the component is basically just the name of the active slot. This makes it really weird when you want to fetch information about stage0_next! I have to do
mgs_client..sp_component_caboose_get(
update.sp_type,
update.slot_id,
&SpComponent::STAGE0.to_string(), // This is the name of the active slot!
1,
)Instead of something like
mgs_client..sp_component_caboose_get(
update.sp_type,
update.slot_id,
&SpComponent::ROT_BOOTLOADER.to_string(),
1,
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's right. "stage0" is the name of the RoT bootloader, and we also use it synonymously with "the active slot for the RoT bootloader". I think I see now what you meant in the comment but I didn't understand it when I read it. Maybe something like: "The naming here is a bit confusing because "stage0" sometimes refers to the component (RoT bootloader) and sometimes refers to the active slot for that component. Here, we're accessing the inactive slot for it. The component is still "stage0"."
| // TODO-K: In post_update we'll be restarting the RoT twice to do signature | ||
| // checks, and to set stage0 to the new version. What happens if the RoT | ||
| // itself is being updated (during the reset stage)? Should we check for that | ||
| // here before setting the RoT bootloader as ready to update? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sequence does some possible. Further, it's really hard (at best) for the control plane to avoid this. It's the old: you can imagine implementing a lock, but then you have the problem of: what if Nexus#1 actually dies permanently with the lock held? That's why people use leases instead of locks, but a lease has the same problem as what we have here: something could validate its lease, then go out to lunch for a long time right before performing the action that's supposed to be protected by the lease. I guess when we revoke the lease we could use Ignition to power-cycle the sled hosting the Nexus, but that raises more questions.
Instead, we've generally opted to allow these sequences but make sure that the end result is acceptable. I think that's largely the case here, though I'm not positive. I think we have to assume that:
- if any of these devices is externally reset (or if the rack loses power) at any point in the process, the device will come up again
- whatever working state the device is in, there is a
PendingMgsUpdatethat can get it into the desired state
In that case, if two updates are stomping on each other, they might cause each others' updates to fail. But as long as they're also both trying to sync up with the latest PendingMgsUpdate, and the planner is updating the latest PendingMgsUpdate's preconditions to match reality, this should converge to the desired end state, right?
| // TODO-K: Again, we're resetting the ROT twice here, what happens | ||
| // if an RoT update is happening at the same time? | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above:
- I think if we reset the device while the RoT update is going on (which should be really unlikely), that update may fail, but the device should come back up one way or another in some state that can be updated again, right?
- If the RoT update changes the contents of the RoT slots (A or B) or changes which slot is active, that doesn't affect this update.
- If the RoT update resets the device while we're doing this, one of these will happen:
- it's fine (e.g., if a reset happened while we were stuck at L224, it would just be an extra reset and wouldn't affect us)
- it causes this update to fail (e.g., because we're unable to do the reset)
we hit the window mentioned in Do not update more than one RoT stage0 at a time in a rack to minimize risk. #7819. This seems possible but very unlikely. It would brick the device. That's bad: let's say the sled would be out of commission. But it's not worse than that (rack service is unaffected, we've just eroded some of our fault tolerance margin). I don't think this is meaningfully more likely than losing power in the same window, and I don't think we can do anything to meaningfully reduce that likelihood any further.I think we can't hit the window I was worried about because I believe @lzrd mentioned the device cannot process an externally-requested reset during this window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't hit the window I was worried about because I believe @lzrd mentioned the device cannot process an externally-requested reset during this window.
I'm a little worried about this case, I'll prod him again when he's back from leave and double check this. I'll document what he tells me :)
| // This is the first time a Nexus instance is attempting to | ||
| // update the RoT bootloader, we don't need to wait for an | ||
| // ongoing update. | ||
| Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is wrong, but it isn't what I had in mind in #7988 (comment). What I was proposing there was that:
precheck()would return:ReadyForUpdateif it looks like no update is in progress (probably: stage0next is valid and matches stage0)WaitForOngoingUpdate(nit: I wouldn't have this be specific to "RoT bootloader") if it looks like an update might be going on (probably: stage0next is invalid or it's valid but doesn't match stage0)
- If we got
WaitForOngoingUpdatehere, we'd wait for up toPROGRESS_TIMEOUTfor it to instead returnReadyForUpdate. If the timeout elapsed, we'd proceed as though we gotReadyForUpdate(but consider it like the "takeover" case -- log it as a takeover and reporthowaccordingly).
The problem with what's here is that we don't know that there's no update ongoing and we might wind up trying to write to stage0next when some other update is trying to validate it and/or persist it. I think that would actually be fine if it happened once, but I don't see anything to prevent it from continuing to happen -- each Nexus constantly interrupting update attempts by other Nexus instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| // We give an additional 30 seconds to as a buffer for the other actions. | ||
| Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => { | ||
| if before.elapsed() | ||
| >= WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the caller-provided timeout here and bump that one if necessary to match the value you're using here? That seems a lot simpler to me than having multiple timeouts, some caller-provided and some hardcoded, plus special knowledge of which timeouts to use for which devices.
| // We'll loop for 3 minutes to wait for any ongoing RoT bootloader update. | ||
| // We need to wait for 2 resets which have a timeout of 60 seconds each, | ||
| // and an attempt to retrieve boot info, which has a time out of 30 seconds. | ||
| // We give an additional 30 seconds to as a buffer for the other actions. | ||
| Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => { | ||
| if before.elapsed() | ||
| >= WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT | ||
| { | ||
| return Err(UpdateWaitError::Timeout( | ||
| WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT, | ||
| )); | ||
| } | ||
|
|
||
| tokio::time::sleep(ROT_BOOLOADER_UPDATE_PROGRESS_INTERVAL) | ||
| .await; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not. More on this in my comment above.
| #[error("invalid RoT bootloader image: {error:?}")] | ||
| RotBootloaderImageError { error: RotImageError }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious for @jgallagher's take on this but it would seem nice to me if the generic parts of this package (this file, the driver, and apply_update) didn't know so much about specific devices. This would preclude this type from including more specific typed errors like RotImageError, but I believe the only thing consumers of this error type care about is that the error is fatal to the update attempt.
So I'd consider renaming RotCommunicationFailed to TransientError and RotBootloaderImageError to FatalError. Both would just contain message: String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. Specifically, RotBootloaderImageError doesn't really mean anything without context. I'll make these more generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e06418e
| error!(log, "post_update failed"; &error); | ||
| return Err(ApplyUpdateError::SpResetFailed(error.to_string())); | ||
| match error { | ||
| PostUpdateError::GatewayClientError(error) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: add a is_transient() (or is_fatal()) to PostUpdateError. Then replace this whole match block with:
if !error.is_transient() {
let error = InlineErrorChain::new(&error);
error!(log, "post_update failed"; &error);
return Err(ApplyUpdateError::SpResetFailed(
error.to_string(),
));
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look @davepacheco! I've made some changes and I'll finish the rest tomorrow hopefully.
| #[error("invalid RoT bootloader image: {error:?}")] | ||
| RotBootloaderImageError { error: RotImageError }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. Specifically, RotBootloaderImageError doesn't really mean anything without context. I'll make these more generic
| // The name for the SP component here is STAGE0 | ||
| // it's a little confusing because we're really | ||
| // trying to reach STAGE0NEXT, and there is no | ||
| // ROT_BOOTLOADER variant. We specify that we | ||
| // want STAGE0NEXT by setting the firmware slot | ||
| // to 1, which is where it will always be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what confused me, is that in the case of the SP you have the overall component SpComponent::SP_ITSELF which has two slots 0 and 1. And for the bootloader you have the component SpComponent::STAGE0 which has two slots stage0 (0) and stage0_next (1). So the name of the component is basically just the name of the active slot. This makes it really weird when you want to fetch information about stage0_next! I have to do
mgs_client..sp_component_caboose_get(
update.sp_type,
update.slot_id,
&SpComponent::STAGE0.to_string(), // This is the name of the active slot!
1,
)Instead of something like
mgs_client..sp_component_caboose_get(
update.sp_type,
update.slot_id,
&SpComponent::ROT_BOOTLOADER.to_string(),
1,
)| // TODO-K: In post_update we'll be restarting the RoT twice to do signature | ||
| // checks, and to set stage0 to the new version. What happens if the RoT | ||
| // itself is being updated (during the reset stage)? Should we check for that | ||
| // here before setting the RoT bootloader as ready to update? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous that a bunch of resets setting versions of two different components will leave one of the two in a state where the device is no longer capable of updating. Is this possible?
| // TODO-K: Again, we're resetting the ROT twice here, what happens | ||
| // if an RoT update is happening at the same time? | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't hit the window I was worried about because I believe @lzrd mentioned the device cannot process an externally-requested reset during this window.
I'm a little worried about this case, I'll prod him again when he's back from leave and double check this. I'll document what he tells me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to review @davepacheco! I think I've addressed all of your comments.
Manual testing:
Before update:
coatlicue@centzon:~/src/omicron$ target/debug/omdb --dns-server [::1]:64561 db inventory collections show latest sp
<...>
Switch SimSidecar1
part number: FAKE_SIM_SIDECAR
power: A2
revision: 0
MGS slot: Switch 1
found at: 2025-06-27 00:22:32.100950 UTC from http://[::1]:33476
cabooses:
SLOT BOARD NAME VERSION GIT_COMMIT SIGN
SpSlot0 SimSidecarSp SimSidecar 0.0.2 ffffffff n/a
SpSlot1 SimSidecarSp SimSidecar 0.0.1 fefefefe n/a
RotSlotA SimRot SimSidecarRot 0.0.4 eeeeeeee 11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf
RotSlotB SimRot SimSidecarRot 0.0.3 edededed 11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf
Stage0 SimRotStage0 SimSidecarRot 0.0.200 ddddddddd 11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf
Stage0Next SimRotStage0 SimSidecarRot 0.0.200 dadadadad 11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf
RoT pages:
SLOT DATA_BASE64
Cmpa c2lkZWNhci1jbXBhAAAAAAAAAAAAAAAA...
CfpaActive c2lkZWNhci1jZnBhLWFjdGl2ZQAAAAAA...
CfpaInactive c2lkZWNhci1jZnBhLWluYWN0aXZlAAAA...
CfpaScratch c2lkZWNhci1jZnBhLXNjcmF0Y2gAAAAA...
RoT: active slot: slot A
RoT: persistent boot preference: slot A
RoT: pending persistent boot preference: -
RoT: transient boot preference: -
RoT: slot A SHA3-256: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
RoT: slot B SHA3-256: bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbLetting reconfigurator-sp-updater run an update attempt twice. One where it completes the update, and another where it finds no changes needed.
〉set SimSidecar1 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236 1.0.0 rot-bootloader -a 0.0.200 -i 0.0.200
updated configuration for SimSidecar1Jun 27 01:37:07.971 INFO begin update attempt for baseboard, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.028 DEBG client request, body: None, uri: http://[::]:49985/artifact/sha256/005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, method: GET, repo_depot_url: http://[::]:49985
Jun 27 01:37:08.029 DEBG client response, result: Ok(Response { url: "http://[::]:49985/artifact/sha256/005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236", status: 200, headers: {"content-type": "application/octet-stream", "x-request-id": "a505f129-b737-4b2e-b460-9f4467de6ffd", "content-length": "750", "date": "Fri, 27 Jun 2025 01:37:08 GMT"} }), repo_depot_url: http://[::]:49985
Jun 27 01:37:08.030 DEBG loaded artifact contents, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.030 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.031 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1", status: 200, headers: {"content-type": "application/json", "x-request-id": "b83006af-812a-4c39-9981-87e9aea0ab7f", "content-length": "734", "date": "Fri, 27 Jun 2025 01:37:07 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.032 DEBG found SP state, state: SpState { base_mac_address: [0, 0, 0, 0, 0, 0], hubris_archive_id: "0000000000000000", model: "FAKE_SIM_SIDECAR", power_state: A2, revision: 0, rot: V3 { active: A, pending_persistent_boot_preference: None, persistent_boot_preference: A, slot_a_error: None, slot_a_fwid: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", slot_b_error: None, slot_b_fwid: "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb", stage0_error: None, stage0_fwid: "cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc", stage0next_error: None, stage0next_fwid: "dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd", transient_boot_preference: None }, serial_number: "SimSidecar1" }, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.032 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=0, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.033 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=0", status: 200, headers: {"content-type": "application/json", "x-request-id": "c1a24c95-84f5-4796-97ed-5f6efc434050", "content-length": "179", "date": "Fri, 27 Jun 2025 01:37:07 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.033 DEBG found active slot caboose, caboose: SpComponentCaboose { board: "SimRotStage0", epoch: None, git_commit: "ddddddddd", name: "SimSidecarRot", sign: Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf"), version: "0.0.200" }, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.034 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=1, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.034 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=1", status: 200, headers: {"content-type": "application/json", "x-request-id": "7e3f2725-6dd9-46b4-a1a6-e829183f2ec6", "content-length": "179", "date": "Fri, 27 Jun 2025 01:37:08 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.035 DEBG ready to start update, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.035 DEBG client request, body: Some(Body), uri: http://[::1]:60958/sp/switch/1/component/stage0/update?firmware_slot=1&id=f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, method: POST, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.036 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/update?firmware_slot=1&id=f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd", status: 204, headers: {"x-request-id": "6fc35ed5-766e-4cfb-a6fa-3fdb0e339980", "date": "Fri, 27 Jun 2025 01:37:08 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.036 INFO update started, mgs_addr: http://[::1]:60958, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.036 DEBG started update, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.036 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/stage0/update-status, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.037 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/update-status", status: 200, headers: {"content-type": "application/json", "x-request-id": "7d43f685-1d75-456c-8240-183c8bab00b5", "content-length": "107", "date": "Fri, 27 Jun 2025 01:37:07 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:08.037 DEBG got update status, status: InProgress { bytes_received: 978, id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, total_bytes: 1024 }, mgs_addr: http://[::1]:60958, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.038 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/stage0/update-status, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.039 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/update-status", status: 200, headers: {"content-type": "application/json", "x-request-id": "37b595ae-7538-4739-84a1-748b90c12f31", "content-length": "64", "date": "Fri, 27 Jun 2025 01:37:11 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.040 DEBG got update status, status: Complete { id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd }, mgs_addr: http://[::1]:60958, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.040 DEBG delivered artifact, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.041 DEBG attempting to reset device to do bootloader signature check, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.041 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/rot/reset, method: POST, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.042 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/rot/reset", status: 204, headers: {"x-request-id": "0fdf52b5-430f-47c8-9b13-0689b6fd9840", "date": "Fri, 27 Jun 2025 01:37:11 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.042 DEBG attempting to retrieve boot info to verify image validity, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.043 DEBG client request, body: Some(Body), uri: http://[::1]:60958/sp/switch/1/component/rot/rot-boot-info, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.043 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/rot/rot-boot-info", status: 200, headers: {"content-type": "application/json", "x-request-id": "e3cbdcea-302b-4482-819b-c247ab1aa8bb", "content-length": "565", "date": "Fri, 27 Jun 2025 01:37:10 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.043 DEBG attempting to set RoT bootloader active slot, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.044 DEBG client request, body: Some(Body), uri: http://[::1]:60958/sp/switch/1/component/stage0/active-slot?persist=true, method: POST, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.044 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/active-slot?persist=true", status: 204, headers: {"x-request-id": "546d31b5-f5f8-46c7-a7e0-30d96d66b3d0", "date": "Fri, 27 Jun 2025 01:37:11 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.044 DEBG attempting to reset device to set to new RoT bootloader version, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.044 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/rot/reset, method: POST, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.045 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/rot/reset", status: 204, headers: {"x-request-id": "50ed22e7-01e0-4681-aec8-79556e64641e", "date": "Fri, 27 Jun 2025 01:37:11 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.045 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.045 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1", status: 200, headers: {"content-type": "application/json", "x-request-id": "734236b8-71ff-48aa-88e1-f9c82fc21e8f", "content-length": "734", "date": "Fri, 27 Jun 2025 01:37:11 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.046 DEBG found SP state, state: SpState { base_mac_address: [0, 0, 0, 0, 0, 0], hubris_archive_id: "0000000000000000", model: "FAKE_SIM_SIDECAR", power_state: A2, revision: 0, rot: V3 { active: A, pending_persistent_boot_preference: None, persistent_boot_preference: A, slot_a_error: None, slot_a_fwid: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", slot_b_error: None, slot_b_fwid: "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb", stage0_error: None, stage0_fwid: "01368372b4c730e54ef9efe240bea5e9d277a3708ddd7eac7115727fde52dda4", stage0next_error: None, stage0next_fwid: "01368372b4c730e54ef9efe240bea5e9d277a3708ddd7eac7115727fde52dda4", transient_boot_preference: None }, serial_number: "SimSidecar1" }, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.046 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=0, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.046 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=0", status: 200, headers: {"content-type": "application/json", "x-request-id": "6b26ea4f-c653-44d0-ac73-d353618f32d6", "content-length": "132", "date": "Fri, 27 Jun 2025 01:37:11 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.047 DEBG found active slot caboose, caboose: SpComponentCaboose { board: "SimRotStage0", epoch: None, git_commit: "this-is-fake-data", name: "SimRotStage0", sign: Some("SimRotStage0"), version: "1.0.0" }, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.047 DEBG precheck result, precheck: Ok(UpdateComplete), update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.047 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.048 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1", status: 200, headers: {"content-type": "application/json", "x-request-id": "d7605e95-518f-45a4-a295-e136a6eab863", "content-length": "734", "date": "Fri, 27 Jun 2025 01:37:11 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.048 DEBG found SP state, state: SpState { base_mac_address: [0, 0, 0, 0, 0, 0], hubris_archive_id: "0000000000000000", model: "FAKE_SIM_SIDECAR", power_state: A2, revision: 0, rot: V3 { active: A, pending_persistent_boot_preference: None, persistent_boot_preference: A, slot_a_error: None, slot_a_fwid: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", slot_b_error: None, slot_b_fwid: "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb", stage0_error: None, stage0_fwid: "01368372b4c730e54ef9efe240bea5e9d277a3708ddd7eac7115727fde52dda4", stage0next_error: None, stage0next_fwid: "01368372b4c730e54ef9efe240bea5e9d277a3708ddd7eac7115727fde52dda4", transient_boot_preference: None }, serial_number: "SimSidecar1" }, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.048 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=0, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.048 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=0", status: 200, headers: {"content-type": "application/json", "x-request-id": "9f7413ce-d377-4cda-b279-702f6bb01076", "content-length": "132", "date": "Fri, 27 Jun 2025 01:37:11 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.049 DEBG found active slot caboose, caboose: SpComponentCaboose { board: "SimRotStage0", epoch: None, git_commit: "this-is-fake-data", name: "SimRotStage0", sign: Some("SimRotStage0"), version: "1.0.0" }, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:11.049 INFO update attempt done, result: CompletedUpdate, elapsed_millis: 3076, update_id: f5fcc5d6-29cf-4f02-8d8d-acd7db44fefd, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.048 INFO dispatching new attempt (retry timer expired), part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1
Jun 27 01:37:31.048 INFO begin update attempt for baseboard, update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.098 DEBG client request, body: None, uri: http://[::]:49985/artifact/sha256/005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, method: GET, repo_depot_url: http://[::]:49985
Jun 27 01:37:31.099 DEBG client response, result: Ok(Response { url: "http://[::]:49985/artifact/sha256/005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236", status: 200, headers: {"content-type": "application/octet-stream", "x-request-id": "f5da3756-a4ee-48f2-befa-fd26cf4808e3", "content-length": "750", "date": "Fri, 27 Jun 2025 01:37:31 GMT"} }), repo_depot_url: http://[::]:49985
Jun 27 01:37:31.101 DEBG loaded artifact contents, update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.101 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.103 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1", status: 200, headers: {"content-type": "application/json", "x-request-id": "4d6237ea-1f04-4348-a40f-f44f4520ae9e", "content-length": "734", "date": "Fri, 27 Jun 2025 01:37:31 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.103 DEBG found SP state, state: SpState { base_mac_address: [0, 0, 0, 0, 0, 0], hubris_archive_id: "0000000000000000", model: "FAKE_SIM_SIDECAR", power_state: A2, revision: 0, rot: V3 { active: A, pending_persistent_boot_preference: None, persistent_boot_preference: A, slot_a_error: None, slot_a_fwid: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", slot_b_error: None, slot_b_fwid: "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb", stage0_error: None, stage0_fwid: "01368372b4c730e54ef9efe240bea5e9d277a3708ddd7eac7115727fde52dda4", stage0next_error: None, stage0next_fwid: "01368372b4c730e54ef9efe240bea5e9d277a3708ddd7eac7115727fde52dda4", transient_boot_preference: None }, serial_number: "SimSidecar1" }, update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.104 DEBG client request, body: None, uri: http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=0, method: GET, mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.104 DEBG client response, result: Ok(Response { url: "http://[::1]:60958/sp/switch/1/component/stage0/caboose?firmware_slot=0", status: 200, headers: {"content-type": "application/json", "x-request-id": "899c1dde-366b-4ce6-b4de-530e9128aeb2", "content-length": "132", "date": "Fri, 27 Jun 2025 01:37:31 GMT"} }), mgs_backend_addr: [::1]:60958, mgs_backend_name: dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal., update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.105 DEBG found active slot caboose, caboose: SpComponentCaboose { board: "SimRotStage0", epoch: None, git_commit: "this-is-fake-data", name: "SimRotStage0", sign: Some("SimRotStage0"), version: "1.0.0" }, update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0
Jun 27 01:37:31.106 INFO update attempt done, result: FoundNoChangesNeeded, elapsed_millis: 57, update_id: d19527d1-0c0b-4707-8333-1df802ba6440, part_number: FAKE_SIM_SIDECAR, serial_number: SimSidecar1, sp_type: Switch, sp_slot: 1, component: rot_bootloader, expected_stage0_version: 0.0.200, expected_stage0_next_version: Version(ArtifactVersion("0.0.200")), artifact_hash: 005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236, artifact_version: 1.0.0After the update:
$ ./target/debug/faux-mgs --sp-sim-addr [::1]:56988 read-component-caboose --component stage0 -s 0 VERS
Jun 27 01:38:23.194 INFO creating SP handle on to talk to SP simulator at [::1]:56988, component: faux-mgs
Jun 27 01:38:23.195 INFO initial discovery complete, addr: [::1]:56988, component: faux-mgs
1.0.0If there isn't anything further to change and you approve this PR, would you mind hitting the merge button as well? (I try my best not to log into work stuff while on vacation 😅 )
| // This is the first time a Nexus instance is attempting to | ||
| // update the RoT bootloader, we don't need to wait for an | ||
| // ongoing update. | ||
| Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| sp_slot: u32, | ||
| timeout: Duration, | ||
| ) -> Result<Option<RotImageError>, PostUpdateError> { | ||
| let mut ticker = tokio::time::interval(Duration::from_secs(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| } | ||
| } | ||
| }, | ||
| Err(error) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a communication error because the SP itself is rebooting. Doesn't hurt to wait a bit? 🤷♀️
| return Err(PostUpdateError::TransientError { | ||
| message, | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like maybe we should wait longer here (WAIT_FOR_BOOT_INFO_TIMEOUT longer, like maybe 2m) because it feels like we really don't want to hit this if the device was going to come back. If we hit this, we're going to wind up returning TransientError, which will cause the caller to call post_update() again, which will reset the device again before polling again. Or maybe we should return a PermanentError here?
It looks like for the SP, post_update() only does the reset and then apply_update() will retry precheck() in a loop, which is more like what we want. The simplest way to mimic that here would be to have post_update() for the RoT bootloader retry a lot longer and return a PermanentError when it gives up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| }); | ||
| } | ||
|
|
||
| // This operation is very delicate. Here, we're overwriting the device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this comment 😆 but it needs to be rewrapped (it runs to over 80 columns).
| match (&expected_stage0_next_version, &found_stage0_next_version) { | ||
| // expected garbage, found garbage | ||
| ( | ||
| ExpectedVersion::NoValidVersion, | ||
| FoundVersion::MissingVersion, | ||
| ) => (), | ||
| // expected a specific version and found it | ||
| ( | ||
| ExpectedVersion::Version(artifact_version), | ||
| FoundVersion::Version(found_stage0_next_version), | ||
| ) if artifact_version.to_string() | ||
| == *found_stage0_next_version => | ||
| { | ||
| () | ||
| } | ||
| // anything else is a mismatch | ||
| (ExpectedVersion::NoValidVersion, FoundVersion::Version(_)) | ||
| | (ExpectedVersion::Version(_), FoundVersion::MissingVersion) | ||
| | (ExpectedVersion::Version(_), FoundVersion::Version(_)) => { | ||
| return Err(PrecheckError::WrongInactiveVersion { | ||
| expected: expected_stage0_next_version.clone(), | ||
| found: found_stage0_next_version, | ||
| }); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't critical, but it feels like this (and maybe even going back to L139) could be commonized between SP, RoT, and RoT bootloader. Maybe a method like ExpectedVersion::matches(&self, found: FoundVersion)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| /// With the RoT bootloader need to wait for 2 resets which have a timeout | ||
| /// of 60 seconds each, and an attempt to retrieve boot info, which has a | ||
| /// time out of 30 seconds. We then give ourselves a few more minutes to act | ||
| /// as a buffer for other pending actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// With the RoT bootloader need to wait for 2 resets which have a timeout | |
| /// of 60 seconds each, and an attempt to retrieve boot info, which has a | |
| /// time out of 30 seconds. We then give ourselves a few more minutes to act | |
| /// as a buffer for other pending actions. | |
| /// With the RoT bootloader, we need to wait for 2 resets, which have a timeout | |
| /// of 60 seconds each; plus an attempt to retrieve boot info, which has a | |
| /// timeout of 30 seconds. We then give ourselves a few more minutes to act | |
| /// as a buffer for other pending actions. |
alternatively, I'm wondering if we should be a lot more specific. Something like:
// Generally, this value covers two different things:
//
// 1. While we're uploading an image to the SP or it's being prepared, how long can the status stay the same before we give up altogether and try again? In practice, this would rarely pause for more than a few seconds.
// 2. The period where we might wait for an update to complete -- either our own update (in which case this is the period after the final device reset until the device comes up reporting the new version) or another instance's update (in which case this could cover almost the _entire_ update process).
//
// In both cases, if the timeout is reached, the whole update attempt will fail. This behavior is only intended to deal with pathological cases, like an MGS crash (which could cause an upload to hang indefinitely) or a Nexus crash (which could cause any update to hang indefinitely at any point). So we can afford to be generous here. Further, we really don't want to trip this erroneously in a working system because we're likely to get stuck continuing to retry and give up before each attempt finishes.
//
// In terms of sizing this timeout:
// - For all updates, the upload phase generally takes 10-20 seconds.
// - For SP updates, the post-reset phase can take about 30s (with Sidecar SPs being the longest).
// - For RoT and RoT bootloader updates, two resets and an intervening "set active slot" operation are required. Together, these could take just a few seconds.
//
// Adding all the above together, and giving ourselves plenty of margin, we choose 10 minutes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| // Check the live state first to see if: | ||
| // - this update has already been completed, or | ||
| // - this update has already been completed, | ||
| // - we are waiting for an ongoing update, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // - we are waiting for an ongoing update, or | |
| // - we should wait a bit because an update may be in-progress, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| Ok(PrecheckStatus::ReadyForUpdate) => break, | ||
| Ok(PrecheckStatus::WaitingForOngoingUpdate) => { | ||
| if before.elapsed() >= progress_timeout { | ||
| warn!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have it so that the returned how on success in this case reflects a takeover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this question is nonsensical, but: if WaitingForOngoingUpdate wasn't a variant of PrecheckStatus at all, and was instead included as a variant of PrecheckError, could we remove all of this? If we're not ready to start the update yet because there's another update running, that seems consistent with other kinds of PrecheckErrors that fire while another update is running (e.g., when we see WrongInactiveVersion because someone else has already started writing the inactive slot).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karencfv: @jgallagher and I discussed this offline. I think I've led us down a bit of an unnecessary path here. Going back to my comment here:
#7988 (comment)
All of this:
- the new
PrecheckStatus::WaitingForOngoingUpdatevariant - the looping here when we see that variant
- the logic in the RoT bootloader impl that returns it
was just for dealing with case (3) in that comment, which is:
if another Nexus enters apply_update() at this point, how does it know not to try to upload another image?
I now believe that this is sufficiently unlikely even without any of the above code to handle it.
First, a reminder that it's okay if this does happen sometimes. The device will not allow us to brick it. If we kick off a new upload while someone else is about to activate stage0next or if someone resets it while we're uploading it or about to activate stage0next, the worst that happens is that an update attempt fails. What we need is for these things to be sufficiently unlikely that eventually (and quickly) an operation will succeed.
There are two ways that I can see this case happening:
-
Two Nexus instances doing concurrent updates, both passing precheck before the other has started an upload. Let's say Nexus 1 starts the upload first.
a. It's very likely that Nexus 2 will try to start an upload before Nexus 1 resets the device. That will fail and Nexus 2 will bug out and wait for Nexus 1 to finish -- great.
b. If Nexus 1 resets the device first, then Nexus 2 starts the upload, that could blow Nexus 1's update out of the water because Nexus 2 is changing stage0next. When Nexus 1 goes to activate it, that will fail because it no longer matches a signed image, since the contents have changed. But in this case, Nexus 2's update attempt should complete successfully, unless we somehow hit this again. We could hit it up to three times (as many as there are Nexus instances), but no more than that. That's because once any Nexus has started an upload, stage0next's contents will change, and any other Nexus instance will fail preconditions and abandon any attempt until the planner changes the expected preconditions. -
Alternatively: Nexus 1 starts an update and gets as far as the first reset. No other Nexus passed preconditions so there will be no concurrent updates ... except that the planner sees the updated stage0next and generates a new blueprint with new preconditions, allowing Nexus 2 to immediately come in and start another update. This plays out like 1a and 1b above, and again, this is fine as long as it doesn't happen that often.
Case 2 here is very similar to the problem described in #8483 and the solution described there (have the planner wait a few minutes before changing preconditions of any MGS update) should make this very unlikely.
All of this is a bit unsatisfying and could benefit from some more formal modeling. But from what @jgallagher and I could think through, it seems pretty unlikely that we'd get stuck here.
The net result of all of this is that I think we can simplify this PR quite a bit by just ignoring this problem altogether:
- rip out the new PrecheckStatus variant
- rip out the code that returned it from the RoT bootloader precheck
- rip out most of the changes to this function (all the stuff around this comment)
and I'm sorry for leading us down the wrong path!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough explanation! This all makes sense to me. Always great to remove complexity so I'm all in. Done.
| // error or an RoT bootloader image error. There is intentionally no | ||
| // timeout here. If we've staged an update but not managed to reset | ||
| // the device, there's no point where we'd want to stop trying to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // error or an RoT bootloader image error. There is intentionally no | |
| // timeout here. If we've staged an update but not managed to reset | |
| // the device, there's no point where we'd want to stop trying to do so. | |
| // error or some other transient error. There is intentionally no | |
| // timeout here. If we've staged an update but not managed to reset | |
| // the device, there's no point where we'd want to stop trying to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| if before.elapsed() >= progress_timeout { | ||
| warn!( | ||
| log, | ||
| "update takeover: timed out while waiting for ongoing update" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary anymore since we removed this logic
|
Thanks! The behavior here looks a lot better. I've got some suggestions for cleanup. Given how tricky this stuff is it'd be nice to get @jgallagher's eyes on it too, if you've got the time. |
| Ok(PrecheckStatus::ReadyForUpdate) => break, | ||
| Ok(PrecheckStatus::WaitingForOngoingUpdate) => { | ||
| if before.elapsed() >= progress_timeout { | ||
| warn!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this question is nonsensical, but: if WaitingForOngoingUpdate wasn't a variant of PrecheckStatus at all, and was instead included as a variant of PrecheckError, could we remove all of this? If we're not ready to start the update yet because there's another update running, that seems consistent with other kinds of PrecheckErrors that fire while another update is running (e.g., when we see WrongInactiveVersion because someone else has already started writing the inactive slot).
| } = &update.details | ||
| else { | ||
| unreachable!( | ||
| "pending MGS update details within ReconfiguratorSpUpdater \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "pending MGS update details within ReconfiguratorSpUpdater \ | |
| "pending MGS update details within ReconfiguratorRotBootloaderUpdater \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| if v == found_stage0_version { | ||
| Ok(PrecheckStatus::ReadyForUpdate) | ||
| } else { | ||
| Ok(PrecheckStatus::WaitingForOngoingUpdate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I feel more strongly that this should be a PrecheckError (following up on my comment above). This feels basically the same as a version mismatch that we expect to see because another update is in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See the other thread on this)
| // Before setting stage0 to the new version we want to ensure | ||
| // the image is good and we're not going to brick the device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like we should be able to brick an RoT in any way by sending normal API commands in some kind of bad order. (This is ignoring cases like "lost power at just the wrong time" as described below, since that's a pretty extenuating circumstance.)
If we didn't do this check and the image wasn't good, would it actually brick it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if my understanding is right, I think it's more like:
"To protect against bricking itself, the device will only activate a new image after it's been verified. Images are only verified at device boot time. Thus, we'll reset the device once to cause the signature to be verified. Then we can activate the new image and reset the device again."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I meant what @davepacheco said. Clearly my English needs improving 😄
| // If the image is not valid we bail | ||
| if let Some(e) = stage0next_error { | ||
| return Err(PostUpdateError::FatalError { | ||
| error: e.to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use InlineErrorChain? Or even better, could the type of error be whatever the real type of e is to avoid having to stringify it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want InlineErrorChain here.
I suggested in the last round of review that PostUpdateError::FatalError just contain a string so that we didn't need PostUpdateError to contain the union of all different errors that each impl might return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| WAIT_FOR_BOOT_INFO_TIMEOUT, | ||
| ) | ||
| .await?; | ||
| // If the image is not valid we bail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would it mean for the update if the image is not valid here? Or maybe: how could this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comment for more clarity
| // The minimum we will ever return is 3. | ||
| // Additionally, V2 does not report image errors, so we cannot | ||
| // know with certainty if a signature check came back with errors | ||
| RotState::V2 { .. } => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return a permanent error here instead of panicking. As far as we know we'll only ever see V3 or later, but this is entirely under the control of an external entity; we should not panic if we get unexpected messages from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
Thanks @jgallagher and @davepacheco for taking the time to review! I think I've addressed all of your comments. Please let me know if there's anything missing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this is really intricate stuff. Just a few minor nit / wording suggestions.
| // Check the live state first to see if: | ||
| // - this update has already been completed, or | ||
| // - this update has already been completed, | ||
| // - we should wait a bit because an update may be in-progress, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment change still applicable now that we've removed the waiting loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Thanks for catching that!
|
|
||
| impl FoundVersion { | ||
| pub fn matches( | ||
| self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - this should take &self. We can .clone() ourselves in the (rare) error case; better than forcing callers to always clone to call us.
| // If boot info contains any error with the image loaded onto | ||
| // stage0_next, we run the risk of bricking the device if this image | ||
| // is loaded onto stage0. We return a fatal error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's an error, the device won't let us load it onto stage0, right? (In particular: nothing we can do using the normal API can brick the device?) If that's right, I'd maybe reword this to something like
| // If boot info contains any error with the image loaded onto | |
| // stage0_next, we run the risk of bricking the device if this image | |
| // is loaded onto stage0. We return a fatal error. | |
| // If boot info contains any error with the image loaded onto | |
| // stage0_next, the device won't let us load this image onto | |
| // stage0. We return a fatal error. |
This commit implements several checks that must happen before updating an RoT bootloader, and post-update actions.
Manual testing on a simulated Omircon:
Previous state
Updating via reconfigurator-sp-updater:
State after the update
Related: #7988