-
Notifications
You must be signed in to change notification settings - Fork 3
Ingest ereports from SPs #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gateway-messages/src/ereport.rs
Outdated
| Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, SerializedSize, | ||
| )] | ||
| #[repr(transparent)] | ||
| pub struct RestartId(pub u128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider using the uuid crate for this, as it supports no_std, but it didn't really seem worth adding another dependency that would have to be compiled in to the Hubris binaries basically just to get UUID-like formatting in Debug impls that are only used by MGS...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that it was nicer to put all the ereport messages (both the SP-to-MGS messages and the MGS-to-SP messages) in their own module, rather than putting some of them in sp_to_mgs and others in mgs_to_sp. This way, a reader interested in the ereport stuff need only read this module, and a reader interested in the control-plane-agent protocol doesn't have to scroll past ereport messages. Future additions to the ereport protocol would change only the code in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to https://github.com/oxidecomputer/management-gateway-service/pull/370/files#r2031615326, it felt nicer to have all the ereport bits defined in their own module, rather than smeared across shared_socket.rs and single_sp.rs. That way, all the ereport-specific code is in one place and it's easier to see the relationship between the code in the ereport socket receive handler and the single-SP handler, rather than having to trace it between modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No argument from me. Do you think it would be clearer to move the control-plane-agent stuff to its own submodule too? (Not as part of this PR, but maybe alongside the renamings I suggested in another comment?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely open to doing that in a subsequent change, I think it seems pretty reasonable (especially if we ever add a third socket/protocol for some other thing). I agree that we shouldn't mess with the control-plane-agent stuff in this PR though.
| tokio.workspace = true | ||
| usdt.workspace = true | ||
| uuid.workspace = true | ||
| uuid = { workspace = true, features = ["v4"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was necessary for unrelated reasons: the task dump code added in #316 uses Uuid::new_v4 in gateway-sp-comms, but gateway-sp-comms doesn't enable the "v4" feature (only faux-mgs does). So, this didn't compile for me using the v2 cargo feature resolver.
gateway-sp-comms/src/ereport.rs
Outdated
| Some(CborValue::Integer(i)) => task_names | ||
| .get(i as usize) | ||
| .cloned() | ||
| .ok_or(DecodeError::BadTaskNameIndex { | ||
| n, | ||
| index: i as usize, | ||
| })?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbiffle this is what I was asking about in https://github.com/oxidecomputer/rfd/pull/849#discussion_r2027376877: right now, this code assumes that when an ereport's task name is an integer, it will always be the index of an ereport that was earlier in the packet than the current one. I wanted to get your confirmation of whether we could rely on that assumption or would need to handle indexes pointing ahead of the current ereport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look particularly closely at the ereport parsing etc details (I'll defer that to you and Cliff, if that's okay). The structural MGS changes look good; just a handful of nits and questions.
|
|
||
| impl Drop for SharedSocket { | ||
| // Hand-rolled `Debug` impl as the message type (`T`) needn't be `Debug` for the | ||
| // `SharedSocket` to be debug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine, but I'll plug https://docs.rs/derive-where/latest/derive_where/ since we use it in omicron, if you want to pull it in here too.
| pub struct SingleSp { | ||
| interface: String, | ||
| cmds_tx: mpsc::Sender<InnerCommand>, | ||
| ereport_req_tx: mpsc::Sender<ereport::WorkerRequest>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not do this as a part of this PR, but I'm curious for your thoughts: there are a bunch of places like this one where we're going from one "thing" (in this case, an mpsc::Sender) to two "things". The original "one" is always named generically, and the new one is named indicating it's related to ereports. Do you think we should go back and rename the generic things to indicate they're intended for control-plane-agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it might be worthwhile to do that (especially if we ever add a third port to the management network...). You're right that I just didn't really want to touch all the control-plane-agent code in this PR, but I'd be happy to do it in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No argument from me. Do you think it would be clearer to move the control-plane-agent stuff to its own submodule too? (Not as part of this PR, but maybe alongside the renamings I suggested in another comment?)
as per [this comment][1] from @jgallagher. this is similar to the control-plane-agent protocol. wow, it's almost like we're reimplementing TCP (but without flow control because thats hard). [1]: #370 (comment)
This branch updates our `zerocopy` dependency from v0.6.x to v0.8.x. I initially made the upgrade as I wanted to use some new `zerocopy` features in #370, but have factored it out to land separately. Of course, we'll need to update Hubris and MGS, as well. Note that `zerocopy`'s `read_from_prefix` now returns the rest of the buffer, making some code a little bit simpler where we were previously doing that manually. Other than that, there's not a lot to this change, besides deriving some additional marker traits (`Immutable` and `KnownLayout`).
This branch updates our `zerocopy` dependency from v0.6.x to v0.8.x. I initially made the upgrade as I wanted to use some new `zerocopy` features in #370, but have factored it out to land separately. Of course, we'll need to update Hubris and MGS, as well. Note that `zerocopy`'s `read_from_prefix` now returns the rest of the buffer, making some code a little bit simpler where we were previously doing that manually. Other than that, there's not a lot to this change, besides deriving some additional marker traits (`Immutable` and `KnownLayout`).
This commit refactors `gateway_sp_comms::SharedSocket` to make the received message handler a generic trait. This way, the `SharedSocket` type and its associated machinery for discovering SPs and forwarding received messages to per-SP handlers can be used for the ereport ingestion socket as well as for `control-plane-agent` messages.
per today's chat with @cbiffle. this is to prevent a malformed ereport from wrecking the rest of the packet if it contains a break byte (0xff) or similar.
this way, if we encounter an individual ereport that's malformed, we decode the rest of the packet and let Nexus decide what to do.
this will make life easier for upstack software
this moves the message types out of `gateway-messages` (used by the `control-plane-agent` task) to their own crate, so that the snitch task can use them. i've also switched from hubpack serialization to `zerocopy`, as this is what the snitch is using. this should permit sharing the type defs between the SP and MGS. note that i've also updated `zerocopy` from v0.6 to v0.8 here, as (AFAICT) the older version doesn't know how to do fallible zerocopy from-bytes conversions for enums. i'm happy to land the version update separately.
This branch updates our dependency on `zerocopy` from v0.6.x to v0.8.x. The primary motivation for this change is that I had wanted to use `zerocopy` v0.8's support for data-bearing `enum`s in the `gateway-ereport-messages` crate I added in oxidecomputer/management-gateway-service#370, and...I hadn't realized how painful taking the `zerocopy` update in Hubris would be. :) But, it's also just good to stay on top of new dependency versions regardless. This is a _very_ large change, since pretty much every place where we derive or use `zerocopy`'s traits needed to be changed slightly to use the new APIs, but for the most part, it's not actually that *interesting*, so reviewing it should be pertty straightforward. The main API changes that are worth noting are: - `AsBytes` is now called `IntoBytes`, which was an easy update. - All the `_from_prefix`/`_from_suffix` APIs now return the rest of the slice, which is a nice improvement --- previously we would manually split off the rest of the slice when using those functions. - Conversions from bytes now return `Result`s instead of `Option`s, which very trivial changes in a few places. - `LayoutVerified` is replaced by a new `Ref` type, but other API changes mean that you now basically never need to use it, which rocks! - Some methods were renamed, which was also a pretty trivial find-and-replace. - `zerocopy` adds new `Immutable` and `KnownLayout` marker traits. `Immutable` is required for `IntoBytes::as_bytes()`; types which do not derive it are now assumed to have interior mutability and only provide `IntoBytes::as_bytes_mut()`. So, basically everything now needs to derive `Immutable`. `KnownLayout` isn't required for APIs we use as commonly but I added it on everything anyway. This was most of what made this update annoying. - `FromBytes::new_zeroed` is now provided by a new `FromZeros` trait, but the `FromBytes` derive implements that trait as well. There may be some places where we can now make better use of new `zerocopy` features. In particular, we can now use `zerocopy` on data-bearing enums, which might allow us to replace `hubpack` with `zerocopy` in several places. I didn't do that in this PR, but it's worth looking into in follow-up changes --- I just wanted to get everything building with the new API and felt that improving our usage of it would be better off done in smaller commits. One other important thing to note is that updating the `gateway-messages` dependency increased stack depth in `control-plane-agent` from 6000B to 6136B for non-sidecar targets, so I bumped them up to 6256B. This, in turn, increases RAM to 65525B for `control-plane-agent`, which exceeds the `max-sizes` config, per @cbiffle's previous advice, I just deleted all the `max-sizes` for `control-plane-agent` tasks. Furthermore, this branch requires the following changes to other crates to pick up the latest `zerocopy`: - oxidecomputer/management-gateway-service#384 - oxidecomputer/idolatry#57 - oxidecomputer/humpty#8 Those ought to merge first so we can point our Git dependencies on those repos back at the `main` branch.
oxidecomputer/management-gateway-service#370 adds code to the `gateway-messages` and `gateway-sp-comms` crates to implement the MGS side of the ereport ingestion protocol. For more information on the protocol itself, refer to the following RFDs: - [RFD 520 Control Plane Fault Ingestion and Data Model][RFD 520] - [RFD 544 Embedded E-Report Formats][RFD 544] - [RFD 545 Firmware E-Report Aggregation and Evacuation][RFD 545] This branch integrates the changes from those crates into the actual MGS application, as well as adding simulated ereports to the SP simulator. I've added some simple tests based on this. In addition, this branch restructures the initial implementation of the control plane ereport API I added in #7833. That branch proposed a single dropshot API that would be implemented by both sled-agent and MGS. This was possible because the initial design would have indexed all ereport producers (reporters) by a UUID. However, per recent conversations with @cbiffle and @jgallagher, we've determined that Nexus will instead request ereports from service processors indexed by SP physical topology (e.g. type and slot), like the rest of the MGS HTTP API. Therefore, we can no longer have a single HTTP API for ereporters that's implemented by both MGS and sled-agents, and instead, SP ereport ingestion should be a new endpoint on the MGS API. This branch does that, moving the ereport query params into `ereport-types`, eliminating the separate `ereport-api` and `ereport-client` crates, and adding an ereport-ingestion-by-SP-location endpoint to the management gateway API. Furthermore, there are some terminology changes. The ereport protocol has a value which we've variously referred to as an "instance ID", a "generation ID", and a "restart nonce", all of which have unfortunate name collisions that are potentially confusing or just unpleasant. We've agreed to refer to this value everywhere as a "restart ID", so this commit also changes that. [RFD 520]: https://rfd.shared.oxide.computer/rfd/0520 [RFD 544]: https://rfd.shared.oxide.computer/rfd/0544 [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545
This branch adds a Nexus background task for ingesting ereports from service processors via MGS, using the MGS API endpoint added in #7903. These APIs in turn expose the MGS/SP ereport ingestion protocol added in oxidecomputer/management-gateway-service#370. For more information on the protocol itself, refer to the following RFDs: - [RFD 520 Control Plane Fault Ingestion and Data Model][RFD 520] - [RFD 544 Embedded E-Report Formats][RFD 544] - [RFD 545 Firmware E-Report Aggregation and Evacuation][RFD 545] In addition to the ereport ingester background task, this branch also adds database tables for storing ereports from SPs, which are necessary to implement the ingestion task. I've also added a table for storing ereports from the sled host OS, which will eventually be ingested via sled-agent. While there isn't currently anything that populates that table, I wanted to begin sketching out how we would represent the two categories of ereports we expect to deal with, and how we would query both tables for ereports. Finally, this branch also adds OMDB commands for querying the ereports stored in the database. These OMDB commands may be useful both for debugging the ereport ingestion subsystem itself *and* for diagnosing issues once the SP firmware actually emits ereports. At present, the higher-level components of the fault-management subsystem, which will process ereports, diagnose faults, and generate alerts, have yet to be implemented. Therefore, the OMDB ereport commands serve as an interim solution for accessing the lower-level data, which may be useful for debugging such faults until the higher-level FMA components exist. [RFD 520]: https://rfd.shared.oxide.computer/rfd/0520 [RFD 544]: https://rfd.shared.oxide.computer/rfd/0544 [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 --------- Co-authored-by: Sean Klein <[email protected]>
Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in #2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in oxidecomputer#2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
This pull request implements the MGS side of the SP ereport ingestion
protocol. For more information on the ereport ingestion protocol, refer
to the following RFDs:
In particular, this branch makes the following changes:
gateway-messagesrepresenting the ereport protocol wiremessages exchanged between MGS and the SP; these are defined in
RFD 545.
shared_socketmodule ingateway-sp-comms. Currently, theSharedSocketcode for handlingreceived packets is tightly coupled to the control plane agent message
types. Ereport requests and responses are sent on a separate UDP port.
Therefore, I've hacked up this code a bit to allow
SharedSockettobe generic over a
RecvHandlertrait that defines how to handlereceived packets and dispatch them to single-SP handlers. This is
implemented for both the control-plane-agent protocol and, separately,
for the ereport protocol.
code for decoding ereport packets and a per-SP worker task that tracks
the metadata sent by the SP and adds it to each batch of ereports.
A corresponding Omicron branch, oxidecomputer/omicron#7903, depends on
this branch and integrates the ereport code into the MGS app binary and
the SP simulator.