Skip to content

Conversation

@cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Feb 5, 2025

No description provided.

@cbiffle cbiffle requested a review from hawkw February 5, 2025 22:20
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 551 files.

Valid Invalid Ignored Fixed
550 1 0 0
Click to see the invalid file list
  • lib/snitch-core/src/lib.rs

&mut self,
first_ena: u64,
) -> impl Iterator<Item = Record<'_>> + '_ {
// Attempt to recover and insert a loss record if necessary.
Copy link
Collaborator

@mkeeter mkeeter Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this function also attempt to recover, versus iter_contents which doesn't mutate the storage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually switched the behavior of iter_contents after writing tests against it, it turns out to be rather annoying to have it not attempt recovery.

}

self.earliest_ena += 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to call recover_if_required here (or even inline its contents), instead of in insert_impl? This appears to be the only point where we become able to recover.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this at least needs more comments, because it took me a bit to figure out what I intended.

There is a reason why recovery is done at insert, instead of here, though I need to re-convince myself that it's a good reason. It's to try and prevent a series of loss records, in the case where

  • we try to insert something big, and enter losing state with a count of 1.
  • Something gets flushed, which opens up enough space for the loss record but not any useful-sized data. If we recovered during flush, we'd emit a loss record here.
  • we try to insert something big again, re-entering losing state with a count of 1.
  • this generates another loss record
  • and so forth

By recovering at insert, and in particular by considering the inserted size plus the recovery requirement when deciding to recover, we can avoid generating sequences of loss records in this corner case.

Does that make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a trio of comments attempting to express this rationale in the source code for posterity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation + the new comments make sense to me!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new comments, thanks for that!

Comment on lines +193 to +215
// So it's weird, but, heapless::Deque only lets you pop individual
// bytes. Hopefully this is not too expensive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it does this...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the question is what the code looks like when called in a loop, in practice, on our target. I haven't evaluated that yet. But it'll probably be fine.

@cbiffle cbiffle changed the title draft PR for snitch impl ereport storage data structure Feb 13, 2025
@cbiffle cbiffle marked this pull request as ready for review February 13, 2025 21:09
@cbiffle
Copy link
Collaborator Author

cbiffle commented Feb 13, 2025

Went ahead and marked this as non-draft since we're through like three rounds of reviews.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at this, focusing on the tests, and I feel quite good about it! It could be worthwhile to consider some kind of "generate a bunch of random operations" style property testing, but...honestly, we don't exactly have a whole bunch of important invariants to assert in such a test, so constructing a double of the expected behavior might not be worth the effort. You've certainly covered all the cases I can immediately anticipate being important in practice based on the rest of the ereport ingestion design. This looks great overall!

/// Arranges for the queue to contain: valid data; a loss record; more valid
/// data. This helps exercise recovery behavior.
#[test]
fn data_loss_sandwich() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥪


/// Tests flushing records that are already gone.
#[test]
fn old_enas_are_nops() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for including this; this will be an important property in practice.

Comment on lines +729 to +735
/// Ensures that we don't throw away the contents of the queue if someone
/// sends us a silly large number.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice of you to handle this!

This implements queueing of records with tracking of data loss, in a
platform-independent fashion that lets it be tested easily.
@cbiffle cbiffle enabled auto-merge (rebase) February 14, 2025 19:15
@cbiffle cbiffle merged commit a2eda0c into master Feb 14, 2025
125 checks passed
@cbiffle cbiffle deleted the cbiffle/snitch2 branch February 14, 2025 19:20
hawkw added a commit that referenced this pull request Aug 13, 2025
Things fail.

Not finding out about it sucks.

This branch implements the Hubris side of the ereport ingestion system,
as described [RFD 545]. Work on this was started by @cbiffle in #2002,
which implemented the core ring-buffer data structure used to store
ereports. Meanwhile, oxidecomputer/management-gateway-service#370, 
oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added
the MGS and Omicron components of this system.

This branch picks up where Cliff left off, and "draws the rest of the
owl" by implementing the aggregation of ereports in the `packrat` task
using this data structure, and adding a new `snitch` task, which acts as
a proxy to allow ereports stored by `packrat` to be read over the
management network.

## Architecture

Ereports are stored by `packrat` because we would like as many tasks as
possible to be able to report errors by making IPC call to the task
responsible for ereport storage. This means that the task aggregating
ereports must be a high-priority task, so that as many other tasks as
possible may be its clients. Additionally, we would like to include the
system's VPD identity as metadata for ereports, and this data is already
stored by packrat. Finally, we would like to minimize the likelihood of
the task that stores ereports crashing, as this would result in data
loss, and packrat already is expected not to crash.

On the other hand, the task that actually evacuates these ereports over
the management network must run at a priority lower than that of the
`net` task, of which it is a client. Thus the separation of
responsibilities between `packrat` and the `snitch`. The snitch task is
fairly simple. It receives packets sent to the ereport socket,
interprets the request message, and forwards the request to packrat. Any
ereports sent back by packrat are sent in response to the request. The
snitch ends up being a pretty dumb, stateless proxy: as the response
packet is encoded by packrat; all we end up doing is taking the bytes
received from packrat and stuffing them into the socket's send queue.
The real purpose of this thing is just to serve as a trampoline between
the high priority level of packrat and a priority level lower than that
of the net task.

## `snitch-core` Fixes

While testing behavior when the ereport buffer is full, I found a
potential panic in the existing `snitch-core` code. Previously, every
time ereports are read from the buffer while it is in the `Losing` state
(i.e., ereports have been discarded because the buffer was full),
`snitch-core` attempts to insert a new loss record at the end of the
buffer (calling `recover_if_needed()`). This ensures that the data loss
is reported to the reader ASAP. The problem is that this code assumed
that there would always be space for an additional loss record, and
panicked if it didn't fit. I added a test reproducing this panic in
ff93754, and fixed it in
22044d1 by changing the calculation of
whether recovery is possible.

When `recover_if_needed` is called while in the `Losing` state, we call
the `free_space()` method to determine whether we can recover. In the
`Losing` state, [this method would calculate the free space by
subtracting the space required for the loss record][1] that must be
encoded to transition out of the `Losing` state. However, in the case
where `recover_if_required()` is called with `required_space: None`
(which indicates that we're not trying to recover because we want to
insert a new record, but just because we want to report ongoing data
loss to the caller), [we check that the free space is greater than or
equal to 0][2]. This means that we would still try to insert a loss
record even if the free space was 0, resulting in a panic. I've fixed
this by moving the check that there's space for a loss record out of the
calculation of `free_space()` and into the _required_ space, in addition
to the requested value (which is 0 in the "we are inserting the loss
record to report loss" case). This way, we only insert the loss record
if it fits, which is the correct behavior.

I've also changed the assignment of ENAs in `snitch-core` to start at 1,
rather than 0, since ENA 0 is reserved in the wire protocol to indicate
"no ENA". In the "committed ENA" request field this means "don't flush
any ereports", and in the "start ENA" response field, ENA 0 means "no
ereports in this packet". Thus, the ereport store must start assigning
ENAs at ENA 1 for the initial loss record.

## Testing

Currently, no tasks actually produce ereports. To test that everything
works correctly, it was necessary to add a source of ereports, so I've
added [a little task][3] that just generates test ereports when asked
via `hiffy`. I've included some of that in [this comment][4]. This was
also used for testing the data-loss behavior discussed above.

[RFD 545]: https://rfd.shared.oxide.computer/rfd/0545
[1]:
    https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121
[2]:
    https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300
[3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
rusty1968 pushed a commit to rusty1968/hubris that referenced this pull request Sep 17, 2025
Things fail.

Not finding out about it sucks.

This branch implements the Hubris side of the ereport ingestion system,
as described [RFD 545]. Work on this was started by @cbiffle in oxidecomputer#2002,
which implemented the core ring-buffer data structure used to store
ereports. Meanwhile, oxidecomputer/management-gateway-service#370, 
oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added
the MGS and Omicron components of this system.

This branch picks up where Cliff left off, and "draws the rest of the
owl" by implementing the aggregation of ereports in the `packrat` task
using this data structure, and adding a new `snitch` task, which acts as
a proxy to allow ereports stored by `packrat` to be read over the
management network.

## Architecture

Ereports are stored by `packrat` because we would like as many tasks as
possible to be able to report errors by making IPC call to the task
responsible for ereport storage. This means that the task aggregating
ereports must be a high-priority task, so that as many other tasks as
possible may be its clients. Additionally, we would like to include the
system's VPD identity as metadata for ereports, and this data is already
stored by packrat. Finally, we would like to minimize the likelihood of
the task that stores ereports crashing, as this would result in data
loss, and packrat already is expected not to crash.

On the other hand, the task that actually evacuates these ereports over
the management network must run at a priority lower than that of the
`net` task, of which it is a client. Thus the separation of
responsibilities between `packrat` and the `snitch`. The snitch task is
fairly simple. It receives packets sent to the ereport socket,
interprets the request message, and forwards the request to packrat. Any
ereports sent back by packrat are sent in response to the request. The
snitch ends up being a pretty dumb, stateless proxy: as the response
packet is encoded by packrat; all we end up doing is taking the bytes
received from packrat and stuffing them into the socket's send queue.
The real purpose of this thing is just to serve as a trampoline between
the high priority level of packrat and a priority level lower than that
of the net task.

## `snitch-core` Fixes

While testing behavior when the ereport buffer is full, I found a
potential panic in the existing `snitch-core` code. Previously, every
time ereports are read from the buffer while it is in the `Losing` state
(i.e., ereports have been discarded because the buffer was full),
`snitch-core` attempts to insert a new loss record at the end of the
buffer (calling `recover_if_needed()`). This ensures that the data loss
is reported to the reader ASAP. The problem is that this code assumed
that there would always be space for an additional loss record, and
panicked if it didn't fit. I added a test reproducing this panic in
ff93754, and fixed it in
22044d1 by changing the calculation of
whether recovery is possible.

When `recover_if_needed` is called while in the `Losing` state, we call
the `free_space()` method to determine whether we can recover. In the
`Losing` state, [this method would calculate the free space by
subtracting the space required for the loss record][1] that must be
encoded to transition out of the `Losing` state. However, in the case
where `recover_if_required()` is called with `required_space: None`
(which indicates that we're not trying to recover because we want to
insert a new record, but just because we want to report ongoing data
loss to the caller), [we check that the free space is greater than or
equal to 0][2]. This means that we would still try to insert a loss
record even if the free space was 0, resulting in a panic. I've fixed
this by moving the check that there's space for a loss record out of the
calculation of `free_space()` and into the _required_ space, in addition
to the requested value (which is 0 in the "we are inserting the loss
record to report loss" case). This way, we only insert the loss record
if it fits, which is the correct behavior.

I've also changed the assignment of ENAs in `snitch-core` to start at 1,
rather than 0, since ENA 0 is reserved in the wire protocol to indicate
"no ENA". In the "committed ENA" request field this means "don't flush
any ereports", and in the "start ENA" response field, ENA 0 means "no
ereports in this packet". Thus, the ereport store must start assigning
ENAs at ENA 1 for the initial loss record.

## Testing

Currently, no tasks actually produce ereports. To test that everything
works correctly, it was necessary to add a source of ereports, so I've
added [a little task][3] that just generates test ereports when asked
via `hiffy`. I've included some of that in [this comment][4]. This was
also used for testing the data-loss behavior discussed above.

[RFD 545]: https://rfd.shared.oxide.computer/rfd/0545
[1]:
    https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121
[2]:
    https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300
[3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants