Skip to content

extra, noisy blueprints during SP update #8483

@davepacheco

Description

@davepacheco

While testing #8291, I found that the autoplanner created several blueprints for an SP update. Here's an example sequence for updating just one SP:

root@oxz_switch1:~# omdb reconfigurator history --diff 
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (153.0.0)
VERSN TIME                     BLUEPRINT                           
    1 2025-06-30T13:57:44.369Z dbda732f-92a7-4bac-8b73-5ef2852bc9f6 disabled: initial blueprint from rack setup
    2 2025-06-30T17:53:50.331Z dbda732f-92a7-4bac-8b73-5ef2852bc9f6  enabled
    3 2025-06-30T17:59:54.185Z c74c0e87-3fd9-4fc4-96b4-32f611c8466e  enabled: 
from: blueprint dbda732f-92a7-4bac-8b73-5ef2852bc9f6
to:   blueprint c74c0e87-3fd9-4fc4-96b4-32f611c8466e

 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
    internal DNS version:::   1 (unchanged)
    external DNS version:::   2 (unchanged)
    target release min gen:   1 (unchanged)

 OXIMETER SETTINGS:
    generation:   1 (unchanged)
    read from::   SingleNode (unchanged)

 PENDING MGS UPDATES:

    Pending MGS-managed updates (all baseboards):
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    sp_type   slot   part_number   serial_number   artifact_hash                                                      artifact_version   details                                                                                                                 
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+   switch    0      913-0000006   BRM23230002     100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145   1.0.39             Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }


    4 2025-06-30T18:00:16.757Z ac6dc10b-1e6e-4302-a37d-65bee7d452d1  enabled: 
from: blueprint c74c0e87-3fd9-4fc4-96b4-32f611c8466e
to:   blueprint ac6dc10b-1e6e-4302-a37d-65bee7d452d1

 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
    internal DNS version:::   1 (unchanged)
    external DNS version:::   2 (unchanged)
    target release min gen:   1 (unchanged)

 OXIMETER SETTINGS:
    generation:   1 (unchanged)
    read from::   SingleNode (unchanged)

 PENDING MGS UPDATES:

    Pending MGS-managed updates (all baseboards):
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    sp_type   slot   part_number   serial_number   artifact_hash                                                      artifact_version   details                                                                                                                   
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*   switch    0      913-0000006   BRM23230002     100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145   1.0.39             - Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }
     └─                                                                                                                                  + Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: NoValidVersion }                    


    5 2025-06-30T18:00:18.391Z 5bf45c64-d209-4715-aa83-c3cddbfa7368  enabled: 
from: blueprint ac6dc10b-1e6e-4302-a37d-65bee7d452d1
to:   blueprint 5bf45c64-d209-4715-aa83-c3cddbfa7368

 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
    internal DNS version:::   1 (unchanged)
    external DNS version:::   2 (unchanged)
    target release min gen:   1 (unchanged)

 OXIMETER SETTINGS:
    generation:   1 (unchanged)
    read from::   SingleNode (unchanged)

 PENDING MGS UPDATES:

    Pending MGS-managed updates (all baseboards):
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    sp_type   slot   part_number   serial_number   artifact_hash                                                      artifact_version   details                                                                                                                   
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*   switch    0      913-0000006   BRM23230002     100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145   1.0.39             - Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: NoValidVersion }                    
     └─                                                                                                                                  + Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }


    6 2025-06-30T18:00:19.000Z 45396397-b669-4573-be39-28457f158fcb  enabled: 
from: blueprint 5bf45c64-d209-4715-aa83-c3cddbfa7368
to:   blueprint 45396397-b669-4573-be39-28457f158fcb

 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
    internal DNS version:::   1 (unchanged)
    external DNS version:::   2 (unchanged)
    target release min gen:   1 (unchanged)

 OXIMETER SETTINGS:
    generation:   1 (unchanged)
    read from::   SingleNode (unchanged)

 PENDING MGS UPDATES:

    Pending MGS-managed updates (all baseboards):
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    sp_type   slot   part_number   serial_number   artifact_hash                                                      artifact_version   details                                                                                                                   
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*   switch    0      913-0000006   BRM23230002     100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145   1.0.39             - Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }
     └─                                                                                                                                  + Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: NoValidVersion }                    


    7 2025-06-30T18:00:42.235Z a610ee4a-d76f-4d94-b975-726202c77732  enabled: 
from: blueprint 45396397-b669-4573-be39-28457f158fcb
to:   blueprint a610ee4a-d76f-4d94-b975-726202c77732

 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
    internal DNS version:::   1 (unchanged)
    external DNS version:::   2 (unchanged)
    target release min gen:   1 (unchanged)

 OXIMETER SETTINGS:
    generation:   1 (unchanged)
    read from::   SingleNode (unchanged)

 PENDING MGS UPDATES:

    Pending MGS-managed updates (all baseboards):
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    sp_type   slot   part_number   serial_number   artifact_hash                                                      artifact_version   details                                                                                                                 
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-   switch    0      913-0000006   BRM23230002     100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145   1.0.39             Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: NoValidVersion }                    
+   switch    1      913-0000006   BRM31230002     100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145   1.0.39             Sp { expected_active_version: ArtifactVersion("1.0.35"), expected_inactive_version: Version(ArtifactVersion("1.0.38")) }

To summarize:

  1. blueprint 3 added a PendingMgsUpdate for switch 0, expecting active 1.0.35 inactive 1.0.38
  2. blueprint 4 altered that PendingMgsUpdate to expect active 1.0.35 inactive
  3. blueprint 5 altered that PendingMgsUpdate to expect active 1.0.35 inactive 1.0.38
  4. blueprint 6 altered that PendingMgsUpdate to expect active 1.0.35 inactive
  5. blueprint 7 removed that PendingMgsUpdate and added a new one

Here's what I'd guess happened:

  • Initially, the switch had 1.0.35 in the active slot and 1.0.38 in the inactive slot. The first update was configured in blueprint 3 based on that.
  • As part of the update process, we wind up clobbering the inactive slot, and the caboose winds up containing NoValidVersion. Some Nexus probably saw an inventory collection in this state and decided to update the expected preconditions of the update to reflect that change.
  • In blueprint 5, some other Nexus probably tried to plan based on a slightly older collection that still had 1.0.38 in the inactive slot, so it changed the plan back.
  • Some Nexus (possibly one of these two) then saw an updated collection and created blueprint 6 with the same change that was in blueprint 4.
  • Finally, the update completed, allowing us to remove that SP update altogether and configure one for a different board.

In this case, blueprints 4-6 were unnecessary. It's not clear there's much impact to all of this aside from the extra activity writing and later deleting blueprints.

What's tricky is that blueprint 4 might have been necessary. Consider the case where the rack loses power during the update. If Nexus didn't create blueprint 4, then the update system would be stuck when it tried to re-execute an update whose precondition doesn't match reality. (Power loss isn't the only case where this would happen; it's really any kind of interruption to the update.)

Blueprint 5 is arguably always wrong, in that it's probably never right to replace a blueprint with one that's created from an older inventory. But that's somewhere between annoying and impossible to determine programmatically (we might not have the old inventory). We could keep the inventory timestamp in the blueprint, but even that's not super well-defined. Besides wall timestamps not necessarily being in-sync and monotonic, two inventory collections can even overlap in time. I'm not sure it's worth trying to solve this problem. (It might be -- I'm really not sure.)

A simpler idea: do not change only the preconditions of an existing PendingMgsUpdate for at least N minutes after the parent blueprint was created. i.e., if a blueprint B1 wrote a PendingMgsUpdate, and you're generating a new blueprint B2 based on B1 less than N minutes later and you would otherwise change the preconditions of a PendingMgsUpdate, just don't. Give things a few minutes to settle.

Another idea: don't generate a blueprint based on an inventory collection that is older than the parent blueprint. This is trying to say: if the collection is older than the last blueprint, then it's probably too old to be useful. This seems tricky though: it could be the collection is technically newer but wasn't available when we generated the parent blueprint, and so it might still have new information. Or maybe there are cases where other things have changed and we still want to generate a new blueprint.

I'm assuming for the time being that this is not an R17 blocker, since it seems like the only problem is extra blueprint planning laps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions