During investigation of #1364, Josh brought up the general point of fault reporting. See this comment thread for context. This issue tracks adding some prototype or preliminary reporting of persistent faults on a sled. In that particular issue, a failure to delete an OPTE port means that the sled cannot be used further, at least for hosting that particular guest instance. We'd like a simple way to track that fact, ideally in CockroachDB, and use that knowledge in Nexus to direct instances (or Oxide services, potentially) to other sleds.
cc @jclulow