Cold boot should handle scrimlet sled-agent restarts

During an update of rack 2, we encountered the following.

As sled agents began to launch, there was a bug (introduced by yours truly) that prevented the agents from getting out of early bootstrap. A new field added to the early network config caused a deserialization error that prevented sled agents from fully starting up. To work around this error, we read the persistent early network config file kept by the bootstore in `/pool/int`, added the missing field, and serialized the file back to `/pool/int`. We then restarted `sled-agent`. This caused `sled-agent` to read the updated early network config, which it was now able to parse. We had also bumped the generation number of the config, which caused the bootstore protocol to propagate this new value to all the other `sled-agent`s.

At this point, things started to move forward again. Sled agents were transitioning from `bootstrap-agent` to `sled-agent`. However, we then hit another roadblock, the switches were not fully initialized. The `sled-agent` we restarted was a scrimlet `sled-agent`. So restarting it took down the switch zone and everything in it. When the switch zone came back up, it came up without any configuration. The `dendrite` service was not listening on the underlay, links had not been configured, addresses had not been configured, etc.

After looking through logs and various different states in the system, we decided to restart the same sled agent again. It got much further this time, with configured links and various other dpd state. However, the system was still not coming up. There was one node in the cluster that had synchronized with an upstream NTP server and had already launched Nexus (presumably in a brief period where the network was fully set up). Other nodes in the cluster had not made any real progress forward. This was because their NTP zones had not reached synchronization yet. After looking around more, we discovered this was due to the fact that there were missing NAT entries on the switches, and some missing address entries.

It appears that there were NAT entries created before our scrimlet sled-agent restart, and the act of restarting that `sled-agent` took out the switch zone clobbering these entries. I believe these entries were created by a different `sled-agent`, one with a boundary NTP zone that needed NAT. So when we restarted the scrimlet `sled-agent`, it had no idea it had missing NAT entries to repopulate. For the missing address entries, these were uplink addresses. They were present in the `uplink` SMF service properties, but they had not been added to the ASIC via Dendrite as local addresses. Not sure how that happened.

The takeaway here is that we need to be able to handle scrimlet sled-agent restarts during cold boot and keep driving forward toward the system coming back online, not getting stuck in half-configured states.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cold boot should handle scrimlet sled-agent restarts #4592

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cold boot should handle scrimlet sled-agent restarts #4592

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions