ereport storage data structure #2002

cbiffle · 2025-02-05T22:20:24Z

No description provided.

github-actions

license-eye has totally checked 551 files.

Valid	Invalid	Ignored	Fixed
550	1	0	0

Click to see the invalid file list

lib/snitch-core/src/lib.rs

lib/snitch-core/src/lib.rs

mkeeter · 2025-02-06T16:24:12Z

lib/snitch-core/src/lib.rs

+        &mut self,
+        first_ena: u64,
+    ) -> impl Iterator<Item = Record<'_>> + '_ {
+        // Attempt to recover and insert a loss record if necessary.


Why does this function also attempt to recover, versus iter_contents which doesn't mutate the storage?

I actually switched the behavior of iter_contents after writing tests against it, it turns out to be rather annoying to have it not attempt recovery.

mkeeter · 2025-02-06T16:27:15Z

lib/snitch-core/src/lib.rs

+            }
+
+            self.earliest_ena += 1;
+        }


Would it make sense to call recover_if_required here (or even inline its contents), instead of in insert_impl? This appears to be the only point where we become able to recover.

So, this at least needs more comments, because it took me a bit to figure out what I intended.

There is a reason why recovery is done at insert, instead of here, though I need to re-convince myself that it's a good reason. It's to try and prevent a series of loss records, in the case where

we try to insert something big, and enter losing state with a count of 1.

Something gets flushed, which opens up enough space for the loss record but not any useful-sized data. If we recovered during flush, we'd emit a loss record here.

we try to insert something big again, re-entering losing state with a count of 1.

this generates another loss record

and so forth

By recovering at insert, and in particular by considering the inserted size plus the recovery requirement when deciding to recover, we can avoid generating sequences of loss records in this corner case.

Does that make sense?

I've added a trio of comments attempting to express this rationale in the source code for posterity.

This explanation + the new comments make sense to me!

I like the new comments, thanks for that!

lib/snitch-core/src/lib.rs

hawkw · 2025-02-13T00:26:14Z

lib/snitch-core/src/lib.rs

+            // So it's weird, but, heapless::Deque only lets you pop individual
+            // bytes. Hopefully this is not too expensive.


looks like it does this...

Yeah, the question is what the code looks like when called in a loop, in practice, on our target. I haven't evaluated that yet. But it'll probably be fine.

lib/snitch-core/src/lib.rs

cbiffle · 2025-02-13T21:09:50Z

Went ahead and marked this as non-draft since we're through like three rounds of reviews.

hawkw

I took another look at this, focusing on the tests, and I feel quite good about it! It could be worthwhile to consider some kind of "generate a bunch of random operations" style property testing, but...honestly, we don't exactly have a whole bunch of important invariants to assert in such a test, so constructing a double of the expected behavior might not be worth the effort. You've certainly covered all the cases I can immediately anticipate being important in practice based on the rest of the ereport ingestion design. This looks great overall!

hawkw · 2025-02-14T00:21:52Z

lib/snitch-core/src/lib.rs

+    /// Arranges for the queue to contain: valid data; a loss record; more valid
+    /// data. This helps exercise recovery behavior.
+    #[test]
+    fn data_loss_sandwich() {


hawkw · 2025-02-14T00:22:45Z

lib/snitch-core/src/lib.rs

+
+    /// Tests flushing records that are already gone.
+    #[test]
+    fn old_enas_are_nops() {


Thanks for including this; this will be an important property in practice.

hawkw · 2025-02-14T00:24:18Z

lib/snitch-core/src/lib.rs

+    /// Ensures that we don't throw away the contents of the queue if someone
+    /// sends us a silly large number.


Nice of you to handle this!

This implements queueing of records with tracking of data loss, in a platform-independent fashion that lets it be tested easily.

@cbiffle

Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in #2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs

@cbiffle

Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in oxidecomputer#2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs

cbiffle requested a review from hawkw February 5, 2025 22:20

github-actions bot reviewed Feb 5, 2025

View reviewed changes

lib/snitch-core/src/lib.rs Show resolved Hide resolved

mkeeter reviewed Feb 6, 2025

View reviewed changes

lib/snitch-core/src/lib.rs Outdated Show resolved Hide resolved

mkeeter reviewed Feb 6, 2025

View reviewed changes

lib/snitch-core/src/lib.rs Outdated Show resolved Hide resolved

mkeeter reviewed Feb 6, 2025

View reviewed changes

lib/snitch-core/src/lib.rs Outdated Show resolved Hide resolved

hawkw reviewed Feb 13, 2025

View reviewed changes

lib/snitch-core/src/lib.rs Outdated Show resolved Hide resolved

lib/snitch-core/src/lib.rs Outdated Show resolved Hide resolved

lib/snitch-core/src/lib.rs Show resolved Hide resolved

hawkw reviewed Feb 13, 2025

View reviewed changes

cbiffle changed the title ~~draft PR for snitch impl~~ ereport storage data structure Feb 13, 2025

cbiffle marked this pull request as ready for review February 13, 2025 21:09

hawkw approved these changes Feb 14, 2025

View reviewed changes

snitch-core: portable core data structure for ereports

553fcd7

This implements queueing of records with tracking of data loss, in a platform-independent fashion that lets it be tested easily.

cbiffle force-pushed the cbiffle/snitch2 branch from 3399683 to 553fcd7 Compare February 14, 2025 19:12

cbiffle enabled auto-merge (rebase) February 14, 2025 19:15

cbiffle merged commit a2eda0c into master Feb 14, 2025
125 checks passed

cbiffle deleted the cbiffle/snitch2 branch February 14, 2025 19:20

hawkw mentioned this pull request Jul 9, 2025

packrat ereport storage and snitch implementation #2126

Merged

		// So it's weird, but, heapless::Deque only lets you pop individual
		// bytes. Hopefully this is not too expensive.

		/// Ensures that we don't throw away the contents of the queue if someone
		/// sends us a silly large number.

ereport storage data structure #2002

ereport storage data structure #2002

Uh oh!

Conversation

cbiffle commented Feb 5, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkeeter Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cbiffle commented Feb 13, 2025

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mkeeter Feb 6, 2025 •

edited

Loading