Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Jul 3, 2025

Things fail.

Not finding out about it sucks.

This branch implements the Hubris side of the ereport ingestion system,
as described RFD 545. Work on this was started by @cbiffle in #2002,
which implemented the core ring-buffer data structure used to store
ereports. Meanwhile, oxidecomputer/management-gateway-service#370,
oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added
the MGS and Omicron components of this system.

This branch picks up where Cliff left off, and "draws the rest of the
owl" by implementing the aggregation of ereports in the packrat task
using this data structure, and adding a new snitch task, which acts as
a proxy to allow ereports stored by packrat to be read over the
management network.

Architecture

Ereports are stored by packrat because we would like as many tasks as
possible to be able to report errors by making IPC call to the task
responsible for ereport storage. This means that the task aggregating
ereports must be a high-priority task, so that as many other tasks as
possible may be its clients. Additionally, we would like to include the
system's VPD identity as metadata for ereports, and this data is already
stored by packrat. Finally, we would like to minimize the likelihood of
the task that stores ereports crashing, as this would result in data
loss, and packrat already is expected not to crash.

On the other hand, the task that actually evacuates these ereports over
the management network must run at a priority lower than that of the
net task, of which it is a client. Thus the separation of
responsibilities between packrat and the snitch. The snitch task is
fairly simple. It receives packets sent to the ereport socket,
interprets the request message, and forwards the request to packrat. Any
ereports sent back by packrat are sent in response to the request. The
snitch ends up being a pretty dumb, stateless proxy: as the response
packet is encoded by packrat; all we end up doing is taking the bytes
received from packrat and stuffing them into the socket's send queue.
The real purpose of this thing is just to serve as a trampoline between
the high priority level of packrat and a priority level lower than that
of the net task.

snitch-core Fixes

While testing behavior when the ereport buffer is full, I found a
potential panic in the existing snitch-core code. Previously, every
time ereports are read from the buffer while it is in the Losing state
(i.e., ereports have been discarded because the buffer was full),
snitch-core attempts to insert a new loss record at the end of the
buffer (calling recover_if_needed()). This ensures that the data loss
is reported to the reader ASAP. The problem is that this code assumed
that there would always be space for an additional loss record, and
panicked if it didn't fit. I added a test reproducing this panic in
ff93754, and fixed it in
22044d1 by changing the calculation of
whether recovery is possible.

When recover_if_needed is called while in the Losing state, we call
the free_space() method to determine whether we can recover. In the
Losing state, this method would calculate the free space by
subtracting the space required for the loss record
that must be
encoded to transition out of the Losing state. However, in the case
where recover_if_required() is called with required_space: None
(which indicates that we're not trying to recover because we want to
insert a new record, but just because we want to report ongoing data
loss to the caller), we check that the free space is greater than or
equal to 0
. This means that we would still try to insert a loss
record even if the free space was 0, resulting in a panic. I've fixed
this by moving the check that there's space for a loss record out of the
calculation of free_space() and into the required space, in addition
to the requested value (which is 0 in the "we are inserting the loss
record to report loss" case). This way, we only insert the loss record
if it fits, which is the correct behavior.

I've also changed the assignment of ENAs in snitch-core to start at 1,
rather than 0, since ENA 0 is reserved in the wire protocol to indicate
"no ENA". In the "committed ENA" request field this means "don't flush
any ereports", and in the "start ENA" response field, ENA 0 means "no
ereports in this packet". Thus, the ereport store must start assigning
ENAs at ENA 1 for the initial loss record.

Testing

Currently, no tasks actually produce ereports. To test that everything
works correctly, it was necessary to add a source of ereports, so I've
added a little task that just generates test ereports when asked
via hiffy. I've included some of that in [this comment][4]. This was
also used for testing the data-loss behavior discussed above.

@hawkw hawkw force-pushed the eliza/snitch-again branch from e6ba297 to adaa20d Compare July 3, 2025 20:53
hawkw added a commit to oxidecomputer/management-gateway-service that referenced this pull request Jul 11, 2025
Presently, the `ereport::Worker` struct [stores the metadata map in an
`Option`][1]. Metadata refresh requests (`restart_id=0, start_ena=0,
limit=0`) are [sent to the SP if the `Option` is `None`][2]. The option
[is set to `Some`][3] if we receive a packet from the SP where the
metadata map is non-empty, or if the restart ID mismatches the requested
one.

If I recall correctly, the `Option` was intended to distinguish between
"we just started up" and "we received an explicit empty metadata map".
But, I don't actually think we _should_ be distinguishing between those
cases. When the SP has restarted and given us an empty metadata map,
this may be because we requested ereports from `packrat` _before_ VPD
has been loaded (as I discussed in
oxidecomputer/hubris#2126 (comment)).
In that case, when the SP sends us an empty metadata map, we want to
keep requesting the metadata on every subsequent request, as it might be
set later.

Thus, this commit just removes the `Option` and has it start out with an
_empty_ map, and overwrites the existing map if the restart IDs are
mismatched, *or* any time the current map is empty and the received one
is non-empty. I've also added an additional test for this behavior.

Fioxes #409

[1]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L79-L83
[2]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L111
[3]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360
hawkw added a commit to oxidecomputer/management-gateway-service that referenced this pull request Jul 11, 2025
Presently, the `ereport::Worker` struct [stores the metadata map in an
`Option`][1]. Metadata refresh requests (`restart_id=0, start_ena=0,
limit=0`) are [sent to the SP if the `Option` is `None`][2]. The option
[is set to `Some`][3] if we receive a packet from the SP where the
metadata map is non-empty, or if the restart ID mismatches the requested
one.

If I recall correctly, the `Option` was intended to distinguish between
"we just started up" and "we received an explicit empty metadata map".
But, I don't actually think we _should_ be distinguishing between those
cases. When the SP has restarted and given us an empty metadata map,
this may be because we requested ereports from `packrat` _before_ VPD
has been loaded (as I discussed in
oxidecomputer/hubris#2126 (comment)).
In that case, when the SP sends us an empty metadata map, we want to
keep requesting the metadata on every subsequent request, as it might be
set later.

Thus, this commit just removes the `Option` and has it start out with an
_empty_ map, and overwrites the existing map if the restart IDs are
mismatched, *or* any time the current map is empty and the received one
is non-empty. I've also added an additional test for this behavior.

Fixes #409

[1]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L79-L83
[2]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L111
[3]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360
@hawkw hawkw added the ⚠️ ereport if you see something, say something! label Jul 16, 2025
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Jul 28, 2025
PR #8296 added the `sp_ereport_ingester` background task to Nexus for
periodically collecting ereports from SPs via MGS. However, the Hubris
PR adding the Hubris task that actually responds to these requests from
the control plane, oxidecomputer/hubris#2126, won't make it in until
after R17. This means that if we release R17 with a control plane that
tries to collect ereports, and a SP firmware that doesn't know how to
respond to such requests, the Nexus logs will be littered with 36 log
lines like this every 30 seconds:

```
20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response
    background_task = sp_ereport_ingester
    gateway_url = http://[fd00:1122:3344:108::2]:12225
    result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} })
20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    background_task = sp_ereport_ingester
    committed_ena = None
    error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    file = nexus/src/app/background/tasks/ereport_ingester.rs:380
    gateway_addr = [fd00:1122:3344:108::2]:12225
    restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart)
    slot = 29
    sp_type = sled
    start_ena = None
```

Similarly, MGS will also have a bunch of noisy complaints about these
requests failing.

The consequences of this are really not terrible: it just means we'll be
logging a lot of errors. But it seems mildly unfortunate to be
constantly trying to do something that's invariably doomed to failure,
and then yelling about how it didn't work. So, this commit adds a config
flag for disabling the whole thing, which we can turn on for R17's
production Nexus config and then turn back off when the Hubris changes
make it in. I did this using a config setting, rather than hard-coding
it to always be disabled, because there are also integration tests for
this stuff, which will break if we disabled it everywhere.
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Jul 29, 2025
PR #8296 added the `sp_ereport_ingester` background task to Nexus for
periodically collecting ereports from SPs via MGS. However, the Hubris
PR adding the Hubris task that actually responds to these requests from
the control plane, oxidecomputer/hubris#2126, won't make it in until
after R16. This means that if we release R16 with a control plane that
tries to collect ereports, and a SP firmware that doesn't know how to
respond to such requests, the Nexus logs will be littered with 36 log
lines like this every 30 seconds:

```
20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response
    background_task = sp_ereport_ingester
    gateway_url = http://[fd00:1122:3344:108::2]:12225
    result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} })
20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    background_task = sp_ereport_ingester
    committed_ena = None
    error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    file = nexus/src/app/background/tasks/ereport_ingester.rs:380
    gateway_addr = [fd00:1122:3344:108::2]:12225
    restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart)
    slot = 29
    sp_type = sled
    start_ena = None
```

Similarly, MGS will also have a bunch of noisy complaints about these
requests failing.

The consequences of this are really not terrible: it just means we'll be
logging a lot of errors. But it seems mildly unfortunate to be
constantly trying to do something that's invariably doomed to failure,
and then yelling about how it didn't work. So, this commit adds a config
flag for disabling the whole thing, which we can turn on for R16's
production Nexus config and then turn back off when the Hubris changes
make it in. I did this using a config setting, rather than hard-coding
it to always be disabled, because there are also integration tests for
this stuff, which will break if we disabled it everywhere.
hawkw added 3 commits August 8, 2025 13:26
we can afford to be expressive with keys in the metadata since it
doesn't compete with ereports for buffer space.
@hawkw hawkw enabled auto-merge (squash) August 13, 2025 16:31
@hawkw hawkw merged commit 10301be into master Aug 13, 2025
135 checks passed
@hawkw hawkw deleted the eliza/snitch-again branch August 13, 2025 16:39
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Aug 13, 2025
For R16, the `sp_ereport_ingester` background task was disabled in the
production Nexus config file (see #8709). This is because the
corresponding Hubris code for evacuating ereports from the SP had not
yet merged, resulting in Nexus yelling constantly about trying to
collect ereports from SPs that weren't listening on the ereport port.
Now, however, oxidecomputer/hubris#2126 has merged, and R16 has been
cut, so we can turn this back on. This commit does that.
hawkw added a commit that referenced this pull request Aug 29, 2025
Based on [a suggestion][1] from @mkeeter, this branch adds an adapter
that implements `minicbor::encode::write::Write` for a cursor into a
writable lease. This lets us avoid double-buffering in the
`read_ereports` path, where we would previously encode each ereport into
the receive buffer, and then copy the contents of that into the lease if
there is space for it. The new approach avoids the memcpy, and also
doesn't encode the ereport in the case where there isn't space left for
it.

I implemented the lease-writer adapter thing in a separate
`minicbor-lease` crate, as I was hoping it would be useful
elsewhere...but I'm not actually sure if it will be, now that I think of
it: the IPC for delivering an ereport to packrat leases the data _from_
the caller, so callers will just encode into a normal buffer they own.
Ah well.

[1]:
#2126 (comment)
rusty1968 pushed a commit to rusty1968/hubris that referenced this pull request Sep 17, 2025
Things fail.

Not finding out about it sucks.

This branch implements the Hubris side of the ereport ingestion system,
as described [RFD 545]. Work on this was started by @cbiffle in oxidecomputer#2002,
which implemented the core ring-buffer data structure used to store
ereports. Meanwhile, oxidecomputer/management-gateway-service#370, 
oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added
the MGS and Omicron components of this system.

This branch picks up where Cliff left off, and "draws the rest of the
owl" by implementing the aggregation of ereports in the `packrat` task
using this data structure, and adding a new `snitch` task, which acts as
a proxy to allow ereports stored by `packrat` to be read over the
management network.

## Architecture

Ereports are stored by `packrat` because we would like as many tasks as
possible to be able to report errors by making IPC call to the task
responsible for ereport storage. This means that the task aggregating
ereports must be a high-priority task, so that as many other tasks as
possible may be its clients. Additionally, we would like to include the
system's VPD identity as metadata for ereports, and this data is already
stored by packrat. Finally, we would like to minimize the likelihood of
the task that stores ereports crashing, as this would result in data
loss, and packrat already is expected not to crash.

On the other hand, the task that actually evacuates these ereports over
the management network must run at a priority lower than that of the
`net` task, of which it is a client. Thus the separation of
responsibilities between `packrat` and the `snitch`. The snitch task is
fairly simple. It receives packets sent to the ereport socket,
interprets the request message, and forwards the request to packrat. Any
ereports sent back by packrat are sent in response to the request. The
snitch ends up being a pretty dumb, stateless proxy: as the response
packet is encoded by packrat; all we end up doing is taking the bytes
received from packrat and stuffing them into the socket's send queue.
The real purpose of this thing is just to serve as a trampoline between
the high priority level of packrat and a priority level lower than that
of the net task.

## `snitch-core` Fixes

While testing behavior when the ereport buffer is full, I found a
potential panic in the existing `snitch-core` code. Previously, every
time ereports are read from the buffer while it is in the `Losing` state
(i.e., ereports have been discarded because the buffer was full),
`snitch-core` attempts to insert a new loss record at the end of the
buffer (calling `recover_if_needed()`). This ensures that the data loss
is reported to the reader ASAP. The problem is that this code assumed
that there would always be space for an additional loss record, and
panicked if it didn't fit. I added a test reproducing this panic in
ff93754, and fixed it in
22044d1 by changing the calculation of
whether recovery is possible.

When `recover_if_needed` is called while in the `Losing` state, we call
the `free_space()` method to determine whether we can recover. In the
`Losing` state, [this method would calculate the free space by
subtracting the space required for the loss record][1] that must be
encoded to transition out of the `Losing` state. However, in the case
where `recover_if_required()` is called with `required_space: None`
(which indicates that we're not trying to recover because we want to
insert a new record, but just because we want to report ongoing data
loss to the caller), [we check that the free space is greater than or
equal to 0][2]. This means that we would still try to insert a loss
record even if the free space was 0, resulting in a panic. I've fixed
this by moving the check that there's space for a loss record out of the
calculation of `free_space()` and into the _required_ space, in addition
to the requested value (which is 0 in the "we are inserting the loss
record to report loss" case). This way, we only insert the loss record
if it fits, which is the correct behavior.

I've also changed the assignment of ENAs in `snitch-core` to start at 1,
rather than 0, since ENA 0 is reserved in the wire protocol to indicate
"no ENA". In the "committed ENA" request field this means "don't flush
any ereports", and in the "start ENA" response field, ENA 0 means "no
ereports in this packet". Thus, the ereport store must start assigning
ENAs at ENA 1 for the initial loss record.

## Testing

Currently, no tasks actually produce ereports. To test that everything
works correctly, it was necessary to add a source of ereports, so I've
added [a little task][3] that just generates test ereports when asked
via `hiffy`. I've included some of that in [this comment][4]. This was
also used for testing the data-loss behavior discussed above.

[RFD 545]: https://rfd.shared.oxide.computer/rfd/0545
[1]:
    https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121
[2]:
    https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300
[3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
rusty1968 pushed a commit to rusty1968/hubris that referenced this pull request Sep 17, 2025
…puter#2164)

Based on [a suggestion][1] from @mkeeter, this branch adds an adapter
that implements `minicbor::encode::write::Write` for a cursor into a
writable lease. This lets us avoid double-buffering in the
`read_ereports` path, where we would previously encode each ereport into
the receive buffer, and then copy the contents of that into the lease if
there is space for it. The new approach avoids the memcpy, and also
doesn't encode the ereport in the case where there isn't space left for
it.

I implemented the lease-writer adapter thing in a separate
`minicbor-lease` crate, as I was hoping it would be useful
elsewhere...but I'm not actually sure if it will be, now that I think of
it: the IPC for delivering an ereport to packrat leases the data _from_
the caller, so callers will just encode into a normal buffer they own.
Ah well.

[1]:
oxidecomputer#2126 (comment)
clockdomain pushed a commit to clockdomain/hubris that referenced this pull request Sep 26, 2025
…puter#2164)

Based on [a suggestion][1] from @mkeeter, this branch adds an adapter
that implements `minicbor::encode::write::Write` for a cursor into a
writable lease. This lets us avoid double-buffering in the
`read_ereports` path, where we would previously encode each ereport into
the receive buffer, and then copy the contents of that into the lease if
there is space for it. The new approach avoids the memcpy, and also
doesn't encode the ereport in the case where there isn't space left for
it.

I implemented the lease-writer adapter thing in a separate
`minicbor-lease` crate, as I was hoping it would be useful
elsewhere...but I'm not actually sure if it will be, now that I think of
it: the IPC for delivering an ereport to packrat leases the data _from_
the caller, so callers will just encode into a normal buffer they own.
Ah well.

[1]:
oxidecomputer#2126 (comment)
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Oct 10, 2025
For R16, the `sp_ereport_ingester` background task was disabled in the
production Nexus config file (see #8709). This is because the
corresponding Hubris code for evacuating ereports from the SP had not
yet merged, resulting in Nexus yelling constantly about trying to
collect ereports from SPs that weren't listening on the ereport port.
Now, however, oxidecomputer/hubris#2126 has merged, and R16 has been
cut, so we can turn this back on. This commit does that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚠️ ereport if you see something, say something!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants