- 
                Notifications
    You must be signed in to change notification settings 
- Fork 59
          [Reconfigurator] Fix planner skipping RoT / bootloader updates (add TufArtifactMeta::board)
          #8872
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
        
          
                Cargo.toml
              
                Outdated
          
        
      | tufaceous-artifact = { git = "https://github.com/oxidecomputer/tufaceous", branch = "main", features = ["proptest", "schemars"] } | ||
| tufaceous-brand-metadata = { git = "https://github.com/oxidecomputer/tufaceous", branch = "main" } | ||
| tufaceous-lib = { git = "https://github.com/oxidecomputer/tufaceous", branch = "main" } | ||
| tufaceous = { git = "https://github.com/oxidecomputer/tufaceous", branch = "john/fake-rot-distinct-sign" } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-  TODO: change this back to main
|  | ||
| # This will attempt to update the first sled's host OS. Walk through that update | ||
| # and the host OS of the two other sleds. | ||
| # This will attempt to update the RoT bootloader on the first sled. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of followup to #8832 (comment). On that PR, I had to change this test to update the host OS so that we could move on to the planner doing things with zones, but in review, the question came up of "why don't we also have to do the same thing for the bootloader, RoT, and SP?". The superficial answer was "the fake TUF repo doesn't have any matching artifacts for those so the planner is skipping them", but I didn't dig into why that is. With the changes on this PR, the fake TUF repo now does have matching artifacts. So instead of just updating the host OS here, we have to do all the MGS-based updates for all three fake sleds.
| // - "sign" matching the rkth (found above from caboose) | ||
|  | ||
| if a.id.name != *board { | ||
| if a.board.as_ref() != Some(board) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the corresponding change in rot_bootloader.rs) is the actual bugfix.
| // - "kind" matching one of the known SP kinds | ||
|  | ||
| if a.id.name != *board { | ||
| if a.board.as_ref() != Some(board) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this for the SP is not strictly required, but seems cleaner.
| ), | ||
| make_artifact( | ||
| "oxide-rot-1", | ||
| "oxide-rot-1-fake-key", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you changed this to verify that we're not accidentally depending on the target's name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, exactly.
| Testing on  and the planner performed both bootloader and RoT updates. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a bunch for catching and fixing this bug!!!!
| board "SimRotStage0" name "SimGimletRot" version "0.0.200" git_commit "dadadadad" sign Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf") | ||
| board "SimRotStage0" name "SimSidecarRot" version "0.0.200" git_commit "dadadadad" sign Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf") | ||
| board "SimRotStage0" name "SimGimletRot" version "0.0.200" git_commit "ddddddddd" sign Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf") | ||
| board "SimRotStage0" name "SimSidecarRot" version "0.0.200" git_commit "ddddddddd" sign Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the cabooses for the RoT bootloader are being removed from these tests, I'm not sure this is something we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm confused; I'll look at this tomorrow. We do still see the RoT bootloader cabooses down below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I get this, and I believe the change is correct albeit surprising. Inventory collections contain two maps that are relevant here:
    /// unique caboose contents that were found in this collection
    ///
    /// In practice, these will be inserted into the `sw_caboose` table.
    pub cabooses: BTreeSet<Arc<Caboose>>,
    /// all caboose contents found, keyed first by the kind of caboose
    /// (`CabooseWhich`), then the baseboard id of the sled where they were
    /// found
    ///
    /// In practice, these will be inserted into the `inv_caboose` table.
    #[serde_as(as = "BTreeMap<_, Vec<(_, _)>>")]
    pub cabooses_found:
        BTreeMap<CabooseWhich, BTreeMap<Arc<BaseboardId>, CabooseFound>>,The table lower in this file is dumping cabooses_found, and it confirms we did still collect all of the RoTs and RoT bootloaders. This table that appears to have lost the bootloader cabooses is because now the RoT and RoT bootloader cabooses are identical. Caboose only contains these fields:
pub struct Caboose {
    pub board: String,
    pub git_commit: String,
    pub name: String,
    pub version: String,
    pub sign: Option<String>,
}In dublin, name, board, sign always match for a given sled's RoT and RoT bootloader; they only differ by git_commit and version (and that's mainly because the bootloader is updated much less frequently than the RoT). In this test data, we use a fixed git commit and version 1.0.0 for both, so all five fields match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, gotcha! That makes sense.
In dublin, name, board, sign always match for a given sled's RoT and RoT bootloader; they only differ by git_commit and version (and that's mainly because the bootloader is updated much less frequently than the RoT).
This doesn't seem great. If the version and git commit of the RoT/RoT bootloader cabooses start matching this could become troublesome 🤔 Is this something we may want to address in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dublin, name, board, sign always match for a given sled's RoT and RoT bootloader; they only differ by git_commit and version (and that's mainly because the bootloader is updated much less frequently than the RoT).
This doesn't seem great. If the version and git commit of the RoT/RoT bootloader cabooses start matching this could become troublesome 🤔 Is this something we may want to address in the future?
Can you expand on what doesn't seem great? I think it's fine if these match entirely; we still also record CabooseWhich so we know which kind / slot these fields came from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and merged this to land the db migration, but if there's more to do here (or an issue to file) I'm happy to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! My concern is unrelated to this PR.
Can you expand on what doesn't seem great?
The line above the cabooses field in the code above says /// In practice, these will be inserted into the sw_caboose table..
That would mean that in a scenario like this, where all of the fields in two or more Cabooses are the same, not every existing caboose would have it's own row.
Is there a possibility that this could break something down the line? It's common all over the codebase to check the length of things to make a decision (e.g. verifying we have the correct amount of retrieved items before proceeding to do something else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line above the
caboosesfield in the code above says/// In practice, these will be inserted into the sw_caboose table..That would mean that in a scenario like this, where all of the fields in two or more Cabooses are the same, not every existing caboose would have it's own row.
Yeah, but this is intentional and by design. We expect to see a bunch of identical Cabooses in practice (e.g., every gimlet-d SP with the same version will have an identical caboose), and this map and its associated table is to intentionally deduplicate those. If we happen to have identical Cabooses across two different kinds of device, that's fine too.
Is there a possibility that this could break something down the line? It's common all over the codebase to check the length of things to make a decision (e.g. verifying we have the correct amount of retrieved items before proceeding to do something else).
I think "no", given the above - anything checking lengths of this map or the associated table already has to account for the fact that it dedups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough! Seems good to me then :)
        
          
                update-common/manifests/fake.toml
              
                Outdated
          
        
      |  | ||
| [[artifact.gimlet_rot_bootloader]] | ||
| name = "SimRotStage0" | ||
| name = "SimRot" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that the name of the artifact needed to be unique 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Artifact IDs are the triple of (name, version, kind), so I think duplicate names are okay as long as one of the other fields (kind in this case) is different.
That said: these names are not the same in prod, so I changed them in ef839f7. I think I changed these early in this branch before I fully understood the name vs board thing; nice catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| board "SimRotStage0" name "SimGimletRot" version "0.0.200" git_commit "dadadadad" sign Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf") | ||
| board "SimRotStage0" name "SimSidecarRot" version "0.0.200" git_commit "dadadadad" sign Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf") | ||
| board "SimRotStage0" name "SimGimletRot" version "0.0.200" git_commit "ddddddddd" sign Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf") | ||
| board "SimRotStage0" name "SimSidecarRot" version "0.0.200" git_commit "ddddddddd" sign Some("11594bb5548a757e918e6fe056e2ad9e084297c9555417a025d8788eacf55daf") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, gotcha! That makes sense.
In dublin, name, board, sign always match for a given sled's RoT and RoT bootloader; they only differ by git_commit and version (and that's mainly because the bootloader is updated much less frequently than the RoT).
This doesn't seem great. If the version and git commit of the RoT/RoT bootloader cabooses start matching this could become troublesome 🤔 Is this something we may want to address in the future?
When trying Reconfigurator-based updates on a racklette today with a real TUF repo, the planner skipped all the RoT bootloader and RoT updates because none of the artifact
names matched theboards reported in inventory. Our tests were assuming these were the same (and in prod they are the same on the SP), but in real TUF repos these don't match: all the RoT and bootloaders report aboardofoxide-rot-1, but the artifact names are pretty varied:A couple things of note here:
oxide-rot-1at all.oxide-rot-1-staging-develandoxide-rot-1-selfsigned-staging-develartifacts of eachkindhave the sameSIGNvalue.The first point means we can't (easily) determine which artifact matches the reported
BORD, because we don't always have any artifact metadata that contains the board name. The second point means we can't ignore theBORDs entirely and work of just theSIGN.Instead, this PR adds a new
boardvalue to thetuf_artifacttable and theTufArtifactMetastruct. Outside of tests, the changes are almost identical to those from #8729 that added asignfield in this same spot. This allows us to stop assuming thatartifact.name==inventory_caboose.board, and instead checkartifact.board==inventory_caboose.board. It depends on oxidecomputer/tufaceous#40, which changes the way our fake data is generated to be more consistent with the state of production devices and artifacts. This perturbs some of the reconfiguartor-cli tests a bit, which significantly inflates the diff; the actual code changes here are pretty small.