-
Notifications
You must be signed in to change notification settings - Fork 208
packrat ereport storage and snitch implementation #2126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mkeeter
reviewed
Jul 11, 2025
mkeeter
reviewed
Jul 11, 2025
mkeeter
reviewed
Jul 11, 2025
mkeeter
approved these changes
Jul 11, 2025
hawkw
added a commit
to oxidecomputer/management-gateway-service
that referenced
this pull request
Jul 11, 2025
Presently, the `ereport::Worker` struct [stores the metadata map in an `Option`][1]. Metadata refresh requests (`restart_id=0, start_ena=0, limit=0`) are [sent to the SP if the `Option` is `None`][2]. The option [is set to `Some`][3] if we receive a packet from the SP where the metadata map is non-empty, or if the restart ID mismatches the requested one. If I recall correctly, the `Option` was intended to distinguish between "we just started up" and "we received an explicit empty metadata map". But, I don't actually think we _should_ be distinguishing between those cases. When the SP has restarted and given us an empty metadata map, this may be because we requested ereports from `packrat` _before_ VPD has been loaded (as I discussed in oxidecomputer/hubris#2126 (comment)). In that case, when the SP sends us an empty metadata map, we want to keep requesting the metadata on every subsequent request, as it might be set later. Thus, this commit just removes the `Option` and has it start out with an _empty_ map, and overwrites the existing map if the restart IDs are mismatched, *or* any time the current map is empty and the received one is non-empty. I've also added an additional test for this behavior. Fioxes #409 [1]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L79-L83 [2]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L111 [3]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360
hawkw
added a commit
to oxidecomputer/management-gateway-service
that referenced
this pull request
Jul 11, 2025
Presently, the `ereport::Worker` struct [stores the metadata map in an `Option`][1]. Metadata refresh requests (`restart_id=0, start_ena=0, limit=0`) are [sent to the SP if the `Option` is `None`][2]. The option [is set to `Some`][3] if we receive a packet from the SP where the metadata map is non-empty, or if the restart ID mismatches the requested one. If I recall correctly, the `Option` was intended to distinguish between "we just started up" and "we received an explicit empty metadata map". But, I don't actually think we _should_ be distinguishing between those cases. When the SP has restarted and given us an empty metadata map, this may be because we requested ereports from `packrat` _before_ VPD has been loaded (as I discussed in oxidecomputer/hubris#2126 (comment)). In that case, when the SP sends us an empty metadata map, we want to keep requesting the metadata on every subsequent request, as it might be set later. Thus, this commit just removes the `Option` and has it start out with an _empty_ map, and overwrites the existing map if the restart IDs are mismatched, *or* any time the current map is empty and the received one is non-empty. I've also added an additional test for this behavior. Fixes #409 [1]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L79-L83 [2]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L111 [3]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360
hawkw
added a commit
to oxidecomputer/omicron
that referenced
this pull request
Jul 28, 2025
PR #8296 added the `sp_ereport_ingester` background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R17. This means that if we release R17 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds: ``` 20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response background_task = sp_ereport_ingester gateway_url = http://[fd00:1122:3344:108::2]:12225 result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} }) 20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } background_task = sp_ereport_ingester committed_ena = None error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } file = nexus/src/app/background/tasks/ereport_ingester.rs:380 gateway_addr = [fd00:1122:3344:108::2]:12225 restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart) slot = 29 sp_type = sled start_ena = None ``` Similarly, MGS will also have a bunch of noisy complaints about these requests failing. The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R17's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.
This was referenced Jul 28, 2025
hawkw
added a commit
to oxidecomputer/omicron
that referenced
this pull request
Jul 29, 2025
PR #8296 added the `sp_ereport_ingester` background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R16. This means that if we release R16 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds: ``` 20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response background_task = sp_ereport_ingester gateway_url = http://[fd00:1122:3344:108::2]:12225 result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} }) 20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } background_task = sp_ereport_ingester committed_ena = None error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } file = nexus/src/app/background/tasks/ereport_ingester.rs:380 gateway_addr = [fd00:1122:3344:108::2]:12225 restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart) slot = 29 sp_type = sled start_ena = None ``` Similarly, MGS will also have a bunch of noisy complaints about these requests failing. The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R16's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.
we can afford to be expressive with keys in the metadata since it doesn't compete with ereports for buffer space.
hawkw
added a commit
to oxidecomputer/omicron
that referenced
this pull request
Aug 13, 2025
For R16, the `sp_ereport_ingester` background task was disabled in the production Nexus config file (see #8709). This is because the corresponding Hubris code for evacuating ereports from the SP had not yet merged, resulting in Nexus yelling constantly about trying to collect ereports from SPs that weren't listening on the ereport port. Now, however, oxidecomputer/hubris#2126 has merged, and R16 has been cut, so we can turn this back on. This commit does that.
hawkw
added a commit
that referenced
this pull request
Aug 29, 2025
Based on [a suggestion][1] from @mkeeter, this branch adds an adapter that implements `minicbor::encode::write::Write` for a cursor into a writable lease. This lets us avoid double-buffering in the `read_ereports` path, where we would previously encode each ereport into the receive buffer, and then copy the contents of that into the lease if there is space for it. The new approach avoids the memcpy, and also doesn't encode the ereport in the case where there isn't space left for it. I implemented the lease-writer adapter thing in a separate `minicbor-lease` crate, as I was hoping it would be useful elsewhere...but I'm not actually sure if it will be, now that I think of it: the IPC for delivering an ereport to packrat leases the data _from_ the caller, so callers will just encode into a normal buffer they own. Ah well. [1]: #2126 (comment)
rusty1968
pushed a commit
to rusty1968/hubris
that referenced
this pull request
Sep 17, 2025
Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in oxidecomputer#2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
rusty1968
pushed a commit
to rusty1968/hubris
that referenced
this pull request
Sep 17, 2025
…puter#2164) Based on [a suggestion][1] from @mkeeter, this branch adds an adapter that implements `minicbor::encode::write::Write` for a cursor into a writable lease. This lets us avoid double-buffering in the `read_ereports` path, where we would previously encode each ereport into the receive buffer, and then copy the contents of that into the lease if there is space for it. The new approach avoids the memcpy, and also doesn't encode the ereport in the case where there isn't space left for it. I implemented the lease-writer adapter thing in a separate `minicbor-lease` crate, as I was hoping it would be useful elsewhere...but I'm not actually sure if it will be, now that I think of it: the IPC for delivering an ereport to packrat leases the data _from_ the caller, so callers will just encode into a normal buffer they own. Ah well. [1]: oxidecomputer#2126 (comment)
clockdomain
pushed a commit
to clockdomain/hubris
that referenced
this pull request
Sep 26, 2025
…puter#2164) Based on [a suggestion][1] from @mkeeter, this branch adds an adapter that implements `minicbor::encode::write::Write` for a cursor into a writable lease. This lets us avoid double-buffering in the `read_ereports` path, where we would previously encode each ereport into the receive buffer, and then copy the contents of that into the lease if there is space for it. The new approach avoids the memcpy, and also doesn't encode the ereport in the case where there isn't space left for it. I implemented the lease-writer adapter thing in a separate `minicbor-lease` crate, as I was hoping it would be useful elsewhere...but I'm not actually sure if it will be, now that I think of it: the IPC for delivering an ereport to packrat leases the data _from_ the caller, so callers will just encode into a normal buffer they own. Ah well. [1]: oxidecomputer#2126 (comment)
hawkw
added a commit
to oxidecomputer/omicron
that referenced
this pull request
Oct 10, 2025
For R16, the `sp_ereport_ingester` background task was disabled in the production Nexus config file (see #8709). This is because the corresponding Hubris code for evacuating ereports from the SP had not yet merged, resulting in Nexus yelling constantly about trying to collect ereports from SPs that weren't listening on the ereport port. Now, however, oxidecomputer/hubris#2126 has merged, and R16 has been cut, so we can turn this back on. This commit does that.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Things fail.
Not finding out about it sucks.
This branch implements the Hubris side of the ereport ingestion system,
as described RFD 545. Work on this was started by @cbiffle in #2002,
which implemented the core ring-buffer data structure used to store
ereports. Meanwhile, oxidecomputer/management-gateway-service#370,
oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added
the MGS and Omicron components of this system.
This branch picks up where Cliff left off, and "draws the rest of the
owl" by implementing the aggregation of ereports in the
packrattaskusing this data structure, and adding a new
snitchtask, which acts asa proxy to allow ereports stored by
packratto be read over themanagement network.
Architecture
Ereports are stored by
packratbecause we would like as many tasks aspossible to be able to report errors by making IPC call to the task
responsible for ereport storage. This means that the task aggregating
ereports must be a high-priority task, so that as many other tasks as
possible may be its clients. Additionally, we would like to include the
system's VPD identity as metadata for ereports, and this data is already
stored by packrat. Finally, we would like to minimize the likelihood of
the task that stores ereports crashing, as this would result in data
loss, and packrat already is expected not to crash.
On the other hand, the task that actually evacuates these ereports over
the management network must run at a priority lower than that of the
nettask, of which it is a client. Thus the separation ofresponsibilities between
packratand thesnitch. The snitch task isfairly simple. It receives packets sent to the ereport socket,
interprets the request message, and forwards the request to packrat. Any
ereports sent back by packrat are sent in response to the request. The
snitch ends up being a pretty dumb, stateless proxy: as the response
packet is encoded by packrat; all we end up doing is taking the bytes
received from packrat and stuffing them into the socket's send queue.
The real purpose of this thing is just to serve as a trampoline between
the high priority level of packrat and a priority level lower than that
of the net task.
snitch-coreFixesWhile testing behavior when the ereport buffer is full, I found a
potential panic in the existing
snitch-corecode. Previously, everytime ereports are read from the buffer while it is in the
Losingstate(i.e., ereports have been discarded because the buffer was full),
snitch-coreattempts to insert a new loss record at the end of thebuffer (calling
recover_if_needed()). This ensures that the data lossis reported to the reader ASAP. The problem is that this code assumed
that there would always be space for an additional loss record, and
panicked if it didn't fit. I added a test reproducing this panic in
ff93754, and fixed it in
22044d1 by changing the calculation of
whether recovery is possible.
When
recover_if_neededis called while in theLosingstate, we callthe
free_space()method to determine whether we can recover. In theLosingstate, this method would calculate the free space bysubtracting the space required for the loss record that must be
encoded to transition out of the
Losingstate. However, in the casewhere
recover_if_required()is called withrequired_space: None(which indicates that we're not trying to recover because we want to
insert a new record, but just because we want to report ongoing data
loss to the caller), we check that the free space is greater than or
equal to 0. This means that we would still try to insert a loss
record even if the free space was 0, resulting in a panic. I've fixed
this by moving the check that there's space for a loss record out of the
calculation of
free_space()and into the required space, in additionto the requested value (which is 0 in the "we are inserting the loss
record to report loss" case). This way, we only insert the loss record
if it fits, which is the correct behavior.
I've also changed the assignment of ENAs in
snitch-coreto start at 1,rather than 0, since ENA 0 is reserved in the wire protocol to indicate
"no ENA". In the "committed ENA" request field this means "don't flush
any ereports", and in the "start ENA" response field, ENA 0 means "no
ereports in this packet". Thus, the ereport store must start assigning
ENAs at ENA 1 for the initial loss record.
Testing
Currently, no tasks actually produce ereports. To test that everything
works correctly, it was necessary to add a source of ereports, so I've
added a little task that just generates test ereports when asked
via
hiffy. I've included some of that in [this comment][4]. This wasalso used for testing the data-loss behavior discussed above.