-
Couldn't load subscription status.
- Fork 60
Description
While testing #8291, I found that the autoplanner created several blueprints for an SP update. Here's an example sequence for updating just one SP:
root@oxz_switch1:~# omdb reconfigurator history --diff
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (153.0.0)
VERSN TIME BLUEPRINT
1 2025-06-30T13:57:44.369Z dbda732f-92a7-4bac-8b73-5ef2852bc9f6 disabled: initial blueprint from rack setup
2 2025-06-30T17:53:50.331Z dbda732f-92a7-4bac-8b73-5ef2852bc9f6 enabled
3 2025-06-30T17:59:54.185Z c74c0e87-3fd9-4fc4-96b4-32f611c8466e enabled:
from: blueprint dbda732f-92a7-4bac-8b73-5ef2852bc9f6
to: blueprint c74c0e87-3fd9-4fc4-96b4-32f611c8466e
COCKROACHDB SETTINGS:
state fingerprint::::::::::::::::: d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
cluster.preserve_downgrade_option: "22.1" (unchanged)
METADATA:
internal DNS version::: 1 (unchanged)
external DNS version::: 2 (unchanged)
target release min gen: 1 (unchanged)
OXIMETER SETTINGS:
generation: 1 (unchanged)
read from:: SingleNode (unchanged)
PENDING MGS UPDATES:
Pending MGS-managed updates (all baseboards):
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sp_type slot part_number serial_number artifact_hash artifact_version details
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ switch 0 913-0000006 BRM23230002 100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145 1.0.39 Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }
4 2025-06-30T18:00:16.757Z ac6dc10b-1e6e-4302-a37d-65bee7d452d1 enabled:
from: blueprint c74c0e87-3fd9-4fc4-96b4-32f611c8466e
to: blueprint ac6dc10b-1e6e-4302-a37d-65bee7d452d1
COCKROACHDB SETTINGS:
state fingerprint::::::::::::::::: d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
cluster.preserve_downgrade_option: "22.1" (unchanged)
METADATA:
internal DNS version::: 1 (unchanged)
external DNS version::: 2 (unchanged)
target release min gen: 1 (unchanged)
OXIMETER SETTINGS:
generation: 1 (unchanged)
read from:: SingleNode (unchanged)
PENDING MGS UPDATES:
Pending MGS-managed updates (all baseboards):
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sp_type slot part_number serial_number artifact_hash artifact_version details
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* switch 0 913-0000006 BRM23230002 100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145 1.0.39 - Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }
└─ + Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: NoValidVersion }
5 2025-06-30T18:00:18.391Z 5bf45c64-d209-4715-aa83-c3cddbfa7368 enabled:
from: blueprint ac6dc10b-1e6e-4302-a37d-65bee7d452d1
to: blueprint 5bf45c64-d209-4715-aa83-c3cddbfa7368
COCKROACHDB SETTINGS:
state fingerprint::::::::::::::::: d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
cluster.preserve_downgrade_option: "22.1" (unchanged)
METADATA:
internal DNS version::: 1 (unchanged)
external DNS version::: 2 (unchanged)
target release min gen: 1 (unchanged)
OXIMETER SETTINGS:
generation: 1 (unchanged)
read from:: SingleNode (unchanged)
PENDING MGS UPDATES:
Pending MGS-managed updates (all baseboards):
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sp_type slot part_number serial_number artifact_hash artifact_version details
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* switch 0 913-0000006 BRM23230002 100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145 1.0.39 - Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: NoValidVersion }
└─ + Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }
6 2025-06-30T18:00:19.000Z 45396397-b669-4573-be39-28457f158fcb enabled:
from: blueprint 5bf45c64-d209-4715-aa83-c3cddbfa7368
to: blueprint 45396397-b669-4573-be39-28457f158fcb
COCKROACHDB SETTINGS:
state fingerprint::::::::::::::::: d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
cluster.preserve_downgrade_option: "22.1" (unchanged)
METADATA:
internal DNS version::: 1 (unchanged)
external DNS version::: 2 (unchanged)
target release min gen: 1 (unchanged)
OXIMETER SETTINGS:
generation: 1 (unchanged)
read from:: SingleNode (unchanged)
PENDING MGS UPDATES:
Pending MGS-managed updates (all baseboards):
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sp_type slot part_number serial_number artifact_hash artifact_version details
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* switch 0 913-0000006 BRM23230002 100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145 1.0.39 - Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }
└─ + Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: NoValidVersion }
7 2025-06-30T18:00:42.235Z a610ee4a-d76f-4d94-b975-726202c77732 enabled:
from: blueprint 45396397-b669-4573-be39-28457f158fcb
to: blueprint a610ee4a-d76f-4d94-b975-726202c77732
COCKROACHDB SETTINGS:
state fingerprint::::::::::::::::: d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
cluster.preserve_downgrade_option: "22.1" (unchanged)
METADATA:
internal DNS version::: 1 (unchanged)
external DNS version::: 2 (unchanged)
target release min gen: 1 (unchanged)
OXIMETER SETTINGS:
generation: 1 (unchanged)
read from:: SingleNode (unchanged)
PENDING MGS UPDATES:
Pending MGS-managed updates (all baseboards):
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sp_type slot part_number serial_number artifact_hash artifact_version details
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- switch 0 913-0000006 BRM23230002 100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145 1.0.39 Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: NoValidVersion }
+ switch 1 913-0000006 BRM31230002 100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145 1.0.39 Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }
To summarize:
- blueprint 3 added a
PendingMgsUpdatefor switch 0, expecting active 1.0.35 inactive 1.0.38 - blueprint 4 altered that
PendingMgsUpdateto expect active 1.0.35 inactive - blueprint 5 altered that
PendingMgsUpdateto expect active 1.0.35 inactive 1.0.38 - blueprint 6 altered that
PendingMgsUpdateto expect active 1.0.35 inactive - blueprint 7 removed that
PendingMgsUpdateand added a new one
Here's what I'd guess happened:
- Initially, the switch had 1.0.35 in the active slot and 1.0.38 in the inactive slot. The first update was configured in blueprint 3 based on that.
- As part of the update process, we wind up clobbering the inactive slot, and the caboose winds up containing NoValidVersion. Some Nexus probably saw an inventory collection in this state and decided to update the expected preconditions of the update to reflect that change.
- In blueprint 5, some other Nexus probably tried to plan based on a slightly older collection that still had 1.0.38 in the inactive slot, so it changed the plan back.
- Some Nexus (possibly one of these two) then saw an updated collection and created blueprint 6 with the same change that was in blueprint 4.
- Finally, the update completed, allowing us to remove that SP update altogether and configure one for a different board.
In this case, blueprints 4-6 were unnecessary. It's not clear there's much impact to all of this aside from the extra activity writing and later deleting blueprints.
What's tricky is that blueprint 4 might have been necessary. Consider the case where the rack loses power during the update. If Nexus didn't create blueprint 4, then the update system would be stuck when it tried to re-execute an update whose precondition doesn't match reality. (Power loss isn't the only case where this would happen; it's really any kind of interruption to the update.)
Blueprint 5 is arguably always wrong, in that it's probably never right to replace a blueprint with one that's created from an older inventory. But that's somewhere between annoying and impossible to determine programmatically (we might not have the old inventory). We could keep the inventory timestamp in the blueprint, but even that's not super well-defined. Besides wall timestamps not necessarily being in-sync and monotonic, two inventory collections can even overlap in time. I'm not sure it's worth trying to solve this problem. (It might be -- I'm really not sure.)
A simpler idea: do not change only the preconditions of an existing PendingMgsUpdate for at least N minutes after the parent blueprint was created. i.e., if a blueprint B1 wrote a PendingMgsUpdate, and you're generating a new blueprint B2 based on B1 less than N minutes later and you would otherwise change the preconditions of a PendingMgsUpdate, just don't. Give things a few minutes to settle.
Another idea: don't generate a blueprint based on an inventory collection that is older than the parent blueprint. This is trying to say: if the collection is older than the last blueprint, then it's probably too old to be useful. This seems tricky though: it could be the collection is technically newer but wasn't available when we generated the parent blueprint, and so it might still have new information. Or maybe there are cases where other things have changed and we still want to generate a new blueprint.
I'm assuming for the time being that this is not an R17 blocker, since it seems like the only problem is extra blueprint planning laps.