update process must work around dendrite#49

Currently, [Dendrite does not tolerate the switch going away and coming back](https://github.com/oxidecomputer/dendrite/issues/49).  More specifically, @rmustacc reported that:

> 1) Nothing tells dendrite the device is gone.
> 2) dendrite doesn't survive it going and usually enters maintenance
> 3) I don't know if we can dynamically add a device into the zone or not via zonecfg
> 4) We tie the switch zone's existence to the device presence, but my guess is we don't want to tare the zone down if it disappears probably

The net result is that today, if the switch resets (as happens when the SP resets as part of an update), it's necessary to reboot the host OS on the corresponding Scrimlet, too.

Assuming this is not fixable in the R17 timeframe, there are a few obvious approaches to working around this in the update system (and probably others I haven't considered):

1. Have the update process plan Scrimlet SP updates immediately after switch SP updates.
2. Have the update process schedule a Scrimlet reboot (a new kind of operation) immediately after a switch SP update.
3. Have the execution of a switch SP update go and bounce the corresponding Scrimlet.
4. Have the scrimlet itself detect this condition and reboot the box.
5. Power off the scrimlet before doing the switch SP update and power it back on afterwards.  Like 1/2 vs. 3, there's a choice here about whether to do this via the planner or executor.

Problem with (1): if the Scrimlet doesn't need any updates, the reboot won't happen.
Downside of (2): if the Scrimlet _does_ need an update, then we're taking an extra bounce for no reason.
Downside of doing: "(1) most of the time, and (2) if it didn't need an update": we almost never test case (2).
Problem with (2) and (3): it's not super obvious how to implement an idempotent "reboot exactly once" operation.  It's not impossible -- e.g., it could be "reboot if the system has synchronized its clock and the system-reported boot time is not newer than time T".  This does seem doable.  It'd probably have to be evaluated by sled agent.
Problem with (1) - (4): all of these assume it's okay to live for a little while in the intermediate state.  Is that true?  Might this situation prevent us from getting ourselves out of it?  (e.g., might a broken dendrite impede the system's ability to carry out one of these options?)  Would the customer be impacted in the meantime?

Does any of this change for systems with only one switch?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update process must work around dendrite#49 #8480

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

update process must work around dendrite#49 #8480

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions