Skip to content

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Jun 23, 2025

This commit implements support for RoT updates in the planner and support for reconfigurator-cli updates as well.

TODO:

@karencfv karencfv changed the title [reconfigurator] CLI support for setting RoT versions [reconfigurator] RoT planner support Jun 25, 2025
@karencfv
Copy link
Contributor Author

@davepacheco just as an FYI this is what I have for the RoT planner support so far. It's not finished yet, but the main bulk of it is done. I just need to refine some bits and add more tests. A big chunk of the lines of code are generated test files btw.

@karencfv karencfv marked this pull request as ready for review June 27, 2025 04:34
@karencfv
Copy link
Contributor Author

I'm not sure how soon we're planning to demo this as part of the update bring-up. As you all know, I'll be out next week. So, if you'd like to use this as part of the demo soon, please feel free to modify/merge/close/superseed this PR as you all see fit, if this is something you'd like to do to speed up things 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8478 may change the way this test works if implemented before I merge this PR

@karencfv
Copy link
Contributor Author

karencfv commented Jul 7, 2025

Heya! 👋 Just a tiny ping that this is ready to review :)

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is looking good but there are a few details to nail down.

Comment on lines +246 to +250
let found_active_version =
ArtifactVersion::new(active_caboose.caboose.version.clone())
.map_err(|e| {
MgsUpdateStatusError::FailedArtifactVersionParse(e)
})?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't very important but it looks like for the SP case we just operated on strings without assuming we could parse them. Do we need to parse the RoT versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to populate ExpectedActiveRotSlot. I thought it'd be cleaner to use the same type as
expected_active_slot since it is a struct containing two fields rather than just a version or a slot.

Don't have a super strong opinion about this though. Happy to change if you think what I did doesn't make sense

@davepacheco davepacheco modified the milestones: 15, 16 Jul 8, 2025
Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! Finally got all the moving pieces in place 🎉 I think I've addressed all the comments and this should be ready for a new round of reviews. Thanks everyone!

),
make_artifact(
"oxide-rot-1",
ArtifactKind::GIMLET_ROT_IMAGE_A,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too! Had to adapt like this unfortunately 🤷‍♀️

};

let board = &active_caboose.caboose.board;
let Some(rkth) = &active_caboose.caboose.sign else {
Copy link
Contributor Author

@karencfv karencfv Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke about comparing against the CMPA page's rkth field instead. But unfortunately, all I get from inventory is a base64 encoded string. I did not find a way in omicron or hubtools to decode that safely into a known structure. While a worthy goal to compare the artifact's sign to the CMPA's rkth field, in the interest of time, I think for now I'd like to compare against the inventory caboose's sign field.

I can write up a follow up issue to tackle this at a later time. #8799

@karencfv karencfv requested a review from davepacheco August 7, 2025 06:50
RoT bootloader stage 0 next version: None
SP active version: Some("0.0.1")
SP inactive version: None
RoT bootloader stage 0 version: Some("0.0.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - this has extra spaces but not enough to line it up with the None on the next line. Should it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urgh, yeah. That was a bad copy-paste. I made them all line up, but it looks awful RoT pending persistent boot preference is too long 😄, and it's not really a table. I think it's better if they don't line up at all.

> # for another sled.
> sled-update-sp 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6 --active 1.0.0
set sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6 SP versions: active -> 1.0.0
> # for an SP on the same sled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation that we'll update the bootloader -> RoT -> SP on one sled before we move to the next sled? I naively thought it would be "update all sleds' bootloaders" -> "update all sleds' RoTs" -> "update all sleds' SPs".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! That is my understanding. Or rather host OS -> bootloader -> RoT -> SP as specified in RFD 565. I'm not sure how this will specifically affect the work you're doing with host OS though. Maybe @davepacheco has more input on this?

),
make_artifact(
"oxide-rot-1",
ArtifactKind::GIMLET_ROT_IMAGE_A,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a lot to be desired here and it's strongly related to the "do something more reasonable with TUF" stuff that came up recently with the measurements work, maybe?

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I think I've addressed all of the comments

I've rolled back the changes for the cmds-target-release test. I need to make some changes to the fake artifacts to include a caboose so the planner can choose an artifact using the sign within them. I have an idea on how to do this based on https://github.com/oxidecomputer/hubtools/blob/main/hubtools/src/archive_builder.rs#L33-L77 (thanks Laura for the tip), but it will take some digging into.

I'd like to work on that test on a follow up PR, to no longer keep holding this one up if people are OK with that. @jgallagher will be working on the planner changes for host OS soon, and this PR makes several changes to the structure of the MGS driven updates in the planner.

Opened: #8798

RoT bootloader stage 0 next version: None
SP active version: Some("0.0.1")
SP inactive version: None
RoT bootloader stage 0 version: Some("0.0.1")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urgh, yeah. That was a bad copy-paste. I made them all line up, but it looks awful RoT pending persistent boot preference is too long 😄, and it's not really a table. I think it's better if they don't line up at all.

> # for another sled.
> sled-update-sp 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6 --active 1.0.0
set sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6 SP versions: active -> 1.0.0
> # for an SP on the same sled.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! That is my understanding. Or rather host OS -> bootloader -> RoT -> SP as specified in RFD 565. I'm not sure how this will specifically affect the work you're doing with host OS though. Maybe @davepacheco has more input on this?

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

return MgsUpdateStatus::Impossible;
}

// If found pending persistent boot preference is not empty, then an update
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble figuring out when we would hit this case. In a normal update, I assume the expected pending persistent boot preference would be None. During the update, we might find it to be Some, but in that case, we would have bailed out above with Impossible. (Is that correct? I guess it's fine, and potentially necessary if this is part of the precondition check.) In that case, we might write a new PendingMgsUpdate where the expected pending persistent boot preference was Some. Next time around, we might wind up here. But is it right to return NotDone? Don't we need to check whether mgs_update_status_inactive_versions() returns Impossible?

Put differently: I'd expect us to check all the Impossible cases, and if none of this is true, then it's NotDone (rather than have any explicit conditions under which we return NotDone). If that's true, I think we could just strike these two if blocks altogether (this one and the next one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to end up in a state where found_pending_persistent_boot_preference and expected_pending_persistent_boot_preference are Some(), which would not be caught in the impossible cases above. But, it's true that then this would be caught as a NotDone in mgs_update_status_inactive_versions(). I liked being explicit as to why we may find ourselves in a NotDone state, but I guess it's not really necessary and introduces complexity.

Comment on lines 707 to 712
"gimlet_rot_image_a" => ARTIFACT_HASH_ROT_GIMLET_A,
"gimlet_rot_image_b" => ARTIFACT_HASH_ROT_GIMLET_B,
"psc_rot_image_a" => ARTIFACT_HASH_ROT_PSC_A,
"psc_rot_image_b" => ARTIFACT_HASH_ROT_PSC_B,
"switch_rot_image_a" => ARTIFACT_HASH_ROT_SWITCH_A,
"switch_rot_image_b" => ARTIFACT_HASH_ROT_SWITCH_B,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way we could avoid hardcoding these strings?

It looks like there are some constants for these in tufaceous_artifact (e.g., ArtifactKind::GIMLET_ROT_IMAGE_A). I'm not sure if you can use these on the left side of the match. If not, maybe a bunch of ifs? It seems worth it to avoid hardcoding these strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really use the constants as they are. It would require StructuralPartialEq. Since this is just a test, it didn't seem worth it to go make changes in tufaceous with a nightly-only experimental API. But sure! I can use a bunch of ifs.

Comment on lines +1324 to +1325
let PendingMgsUpdateDetails::Rot {
expected_active_slot: old_expected_active_slot,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding right, this changed because the surrounding code was just trying to do an update, previously that was always an SP update, but the way you changed things at L1293 - L1295 meant that two updates were possible, and the RoT one would be first.

I would suggest doing one of these:

  • have this do the SP update just like it used to. (Change L1294 to be an empty map, I think?)
  • make it clearer that this will do an RoT update by making that the only one possible (change L1291 to be an empty map?)
  • duplicate the test, once for each device kind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! In the end I decided to completely extract everything RoT related and create another test for RoT only. test_basic was getting pretty long and complex, and it was only going to get worse when bootloader and host OS are implemented 😅

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review! I think I've addressed all of your comments. Let me know if I'm missing anything!

Comment on lines 707 to 712
"gimlet_rot_image_a" => ARTIFACT_HASH_ROT_GIMLET_A,
"gimlet_rot_image_b" => ARTIFACT_HASH_ROT_GIMLET_B,
"psc_rot_image_a" => ARTIFACT_HASH_ROT_PSC_A,
"psc_rot_image_b" => ARTIFACT_HASH_ROT_PSC_B,
"switch_rot_image_a" => ARTIFACT_HASH_ROT_SWITCH_A,
"switch_rot_image_b" => ARTIFACT_HASH_ROT_SWITCH_B,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really use the constants as they are. It would require StructuralPartialEq. Since this is just a test, it didn't seem worth it to go make changes in tufaceous with a nightly-only experimental API. But sure! I can use a bunch of ifs.

Comment on lines +1324 to +1325
let PendingMgsUpdateDetails::Rot {
expected_active_slot: old_expected_active_slot,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! In the end I decided to completely extract everything RoT related and create another test for RoT only. test_basic was getting pretty long and complex, and it was only going to get worse when bootloader and host OS are implemented 😅

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Facilities for making choices about RoT updates
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can tackle that in a follow up PR :)

return MgsUpdateStatus::Impossible;
}

// If found pending persistent boot preference is not empty, then an update
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to end up in a state where found_pending_persistent_boot_preference and expected_pending_persistent_boot_preference are Some(), which would not be caught in the impossible cases above. But, it's true that then this would be caught as a NotDone in mgs_update_status_inactive_versions(). I liked being explicit as to why we may find ourselves in a NotDone state, but I guess it's not really necessary and introduces complexity.

// than 1 artifact for the same board and root key table hash (RKTH)
// that can be verified afgainst the RoT's CMPA/CFPA. But it doesn't
// prevent us from picking one and proceeding. Make a note and proceed.
error!(log, "found more than one matching artifact for RoT update");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh 🤔 I'm not sure why I changed that

@karencfv karencfv enabled auto-merge (squash) August 11, 2025 20:32
@karencfv karencfv disabled auto-merge August 11, 2025 20:36
@karencfv
Copy link
Contributor Author

Looks like there is a merge freeze for R16. I'll merge once that's open again

@karencfv karencfv enabled auto-merge (squash) August 11, 2025 23:35
@karencfv karencfv merged commit 1591e21 into oxidecomputer:main Aug 12, 2025
17 checks passed
@karencfv karencfv deleted the rot-planner branch August 12, 2025 00:39
karencfv added a commit that referenced this pull request Aug 12, 2025
…ule (#8826)

This is just extracting code and moving it to another file. No changes
have been made. Follow up to
#8421 (comment)
karencfv added a commit that referenced this pull request Aug 13, 2025
Analogous to #8421 for RoT
bootloader. Reconfigurator CLI support has already been implemented
#8620

Closes: #8668
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants