[nexus] SP ereport ingestion #8296

hawkw · 2025-06-09T17:12:08Z

This branch adds a Nexus background task for ingesting ereports from
service processors via MGS, using the MGS API endpoint added in #7903.
These APIs in turn expose the MGS/SP ereport ingestion protocol added in
oxidecomputer/management-gateway-service#370.

For more information on the protocol itself, refer to the following
RFDs:

In addition to the ereport ingester background task, this branch also
adds database tables for storing ereports from SPs, which are necessary
to implement the ingestion task. I've also added a table for storing
ereports from the sled host OS, which will eventually be ingested via
sled-agent. While there isn't currently anything that populates that
table, I wanted to begin sketching out how we would represent the two
categories of ereports we expect to deal with, and how we would query
both tables for ereports.

Finally, this branch also adds OMDB commands for querying the ereports
stored in the database. These OMDB commands may be useful both for
debugging the ereport ingestion subsystem itself and for diagnosing
issues once the SP firmware actually emits ereports. At present, the
higher-level components of the fault-management subsystem, which will
process ereports, diagnose faults, and generate alerts, have yet to be
implemented. Therefore, the OMDB ereport commands serve as an interim
solution for accessing the lower-level data, which may be useful for
debugging such faults until the higher-level FMA components exist.

well, most of it, anyway

it's because you just predict tokens and don't actually understand ownership btw

hawkw · 2025-06-18T16:59:09Z

CI failure for 372d582 was that weird Tokio "Error 0" issue again, restarted it.

hawkw · 2025-06-18T17:11:30Z

@smklein I think I've addressed or responded to all your comments, would love another look when you have the time!

jgallagher · 2025-06-16T19:20:10Z

schema/crdb/dbinit.sql

+    sp_type,
+    sp_slot,
+    time_collected
+);


Should this index guard on time_deleted IS NULL? (Same question on the host table index below)

Yeah, probably; I'll add that!

done in 06ce0a8!

so, adding this made the "get latest ereport ID for SP slot" queries fail, as they were no longer hitting the index and doing a full table scan. i had to add a filter clause on time_deleted.is_null() in b1bce52. i'm a bit sketched out by that, because it means those queries might not actually see the actual latest ID if we've deleted it.

the consequences of this are that if we delete an ereport from the current restart generation, we will keep asking the reporter for ereports starting at an earlier ENA, and then the reporter may give us the same ereport again. then, when we go to insert that ereport into the DB, nothing will happen, since another row with the same primary key is already there...but we'll continue asking for the previous ENA unless the SP gave us something from later than the deleted record.

i think this will be okay depending on the conditions under which we delete ereports, which haven't really been worked out yet. i think we would probably want to only delete ereports from older restarts, and not drop anything from the current restart of the reporter, as those are necessary to ensure we're always requesting the latest ENA.

alternatively, we could remove the WHERE time_deleted IS NULL clause here, so that deleted ereports still count for the purposes of determining the latest ID. that way, this would always do the right thing even in the face of Nexus deleting stuff, but it means the index is a lot bigger, and kind of violates the principle of "if something is soft deleted, you should always behave as though it's actually hard-deleted (since we might come along and hard-delete stuff that's soft deleted at any arbitrary point in time)".

another alternative would be to add a new table storing the last seen ENA and restart ID from each reporter, which is updated every time stuff is inserted, rather than getting it from an index. i wasn't sure about that as it seemed like it required the insert operation to be a transaction, which i wanted to avoid.

another alternative would be to add a new table storing the last seen ENA and restart ID from each reporter, which is updated every time stuff is inserted, rather than getting it from an index. i wasn't sure about that as it seemed like it required the insert operation to be a transaction, which i wanted to avoid.

hmm, i suppose another approach to this would be to use a CRDB trigger to update a table of "latest (restart ID, ena) tuples from each reporter" every time ereports are inserted. i'm not totally sure what the tradeoffs of this are --- i haven't noticed any use of triggers elsewhere in Nexus (though I may have overlooked something), but i'm not sure whether this is because we have reasons to avoid them or just because we haven't needed them yet?

kind of violates the principle of "if something is soft deleted, you should always behave as though it's actually hard-deleted (since we might come along and hard-delete stuff that's soft deleted at any arbitrary point in time)".

This is my biggest concern. I'm very, very skeptical of queries that don't filter out soft deleted items. I know we have some that are effectively doing garbage collection, but those are okay in that they still work correctly if the rows have been hard deleted. I don't think it's correct for a query's correctness to depend on not hard deleting soft-deleted rows.

Restricting deletes seems fine in principle - we do this for blueprints (i.e., you cannot delete the current target blueprint). Although maybe it's more involved here to know which things can and can't be deleted?

another alternative would be to add a new table storing the last seen ENA and restart ID from each reporter, which is updated every time stuff is inserted, rather than getting it from an index. i wasn't sure about that as it seemed like it required the insert operation to be a transaction, which i wanted to avoid.

This also seems like it'd probably be fine? Agreed it makes inserts more expensive, so not doing it would be great, but (presumably...?) inserts are relatively rare?

hmm, i suppose another approach to this would be to use a CRDB trigger to update a table of "latest (restart ID, ena) tuples from each reporter" every time ereports are inserted. i'm not totally sure what the tradeoffs of this are --- i haven't noticed any use of triggers elsewhere in Nexus (though I may have overlooked something), but i'm not sure whether this is because we have reasons to avoid them or just because we haven't needed them yet?

I think this is intentional but I'm not finding any docs explaining it. Hand-wavy concerns about how triggers might perform, maybe?

schema/crdb/dbinit.sql

jgallagher · 2025-06-18T15:38:23Z

nexus/db-model/src/ereport.rs

+                time_collected,
+                time_deleted,
+                collector_id: collector_id.into(),
+                part_number: None, // TODO


Is the TODO for this PR or future work? (If the latter, maybe worth filing an issue to link to?)

For my own curiosity: would we expect this to be the model number (which ultimately comes from the SP)?

BRM27230037 # /usr/platform/oxide/bin/ipcc ident Serial: 'BRM27230037' Model: '913-0000019' Rev: 0xd

Yeah, this is supposed to be the model number — for host OS ereports, where we have a sled UUID, I was going to do a JOIN to get this from the inventory table, but since we're not currently collecting host OS ereports, I didn't add that yet. I'll make the comment reflect that, or just go and do it...

jgallagher · 2025-06-18T15:54:41Z

nexus/db-queries/src/db/datastore/ereport.rs

+    ///
+    /// This function queries both the service-processor and host OS ereport
+    /// tables, and returns a `NotFound` error if neither table contains an
+    /// ereport with the requested ID.


Apologies if this is a really dumb question, but - do we expect callers to have an ID without knowing what the source of that report was at a high level (i.e., "an SP" or "a host")? I assume the most salient bits of the report are going to be inside the JSON payload, and presumably those will need to be interpreted based on the source too?

I'm generally fuzzy on when we'd want to view a pile of ereports from various sources vs splitting it out by source ("here are the SP ereports", "here are the host ereports", ...). I would have assumed more of the latter, but I haven't kept up on all the background here so might be way off base!

Personally, I think we are likelier to want to look up ereports either by identity (i.e. "give me all the ereports from BRM690420, whether from the SP or host system, and regardless of where that serial currently is located""), or by location (i.e. "give me all ereports from sleds in cubby 9 over the last two weeks") than we are to want to look up host OS and SP ereports separately, so I've tried to provide queries based on those use cases.

I agree that at present, there isn't much of a use-case for a "fetch a single ereport by ID" query, since...how would you know what that ereport's ID is without already having it? But, I anticipate this being used in the future when we might have a list of ereports that are linked/related to higher level entities (faults/active problems); you might have some way of saying "these are the IDs of ereports that were involved in the diagnosis of this fault" and then go and query them to get the raw data. That was how I expected this query would be used in the future.

Most of the queries in this module aren't actually being used currently, but I wanted to sketch them out now to gain a sense for how this will work later. I anticipate that they may change somewhat as we start actually needing to query this data...

nexus/db-queries/src/db/datastore/ereport.rs

nexus/src/app/background/tasks/ereport_ingester.rs

smklein

Looks good on my end!

this ensures that the index is actually used --- the change in 06ce0a8 added `WHERE time_deleted IS NULL` to the indices to look up ereports by sled/SP slot, and this made the latest-ereport queries start doing a full table scan. adding the filter clause fixes that.

@jgallagher

as suggested by @jgallagher in #8296 (comment). this way, we no longer hard-code SP IDs in Nexus, which seems much ncicer.

these are now added to the OpContext metadata, so no sense in repeating them.

jgallagher

Changes LGTM - thanks for the extra docs explaining the JSON blob.

I'm fine with the current state of time_deleted - filter out soft-deleted things, and (presumably) impose restrictions on what things can be deleted once we add deletion? Also seems fine to punt the choice of restrict delete / add table tracking ENAs / investigate triggers to a later date, since we don't currently delete at all?

hawkw · 2025-06-23T19:07:11Z

I'm fine with the current state of time_deleted - filter out soft-deleted things, and (presumably) impose restrictions on what things can be deleted once we add deletion? Also seems fine to punt the choice of restrict delete / add table tracking ENAs / investigate triggers to a later date, since we don't currently delete at all?

Yeah, I figured it was okay to punt on it for the time being, but wanted to write down my thoughts as it's an open question that we'll need to figure out eventually.

hawkw · 2025-06-24T22:36:39Z

Looks like one of the CI failures on 47d6b30 is my fault, as I forgot to update the OMDB expectorate tests (whoops), but the other one didn't get that far, and instead ran into some kind of buildomat sadness.

PR #8296 added the `sp_ereport_ingester` background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R17. This means that if we release R17 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds: ``` 20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response background_task = sp_ereport_ingester gateway_url = http://[fd00:1122:3344:108::2]:12225 result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} }) 20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } background_task = sp_ereport_ingester committed_ena = None error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } file = nexus/src/app/background/tasks/ereport_ingester.rs:380 gateway_addr = [fd00:1122:3344:108::2]:12225 restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart) slot = 29 sp_type = sled start_ena = None ``` Similarly, MGS will also have a bunch of noisy complaints about these requests failing. The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R17's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.

PR #8296 added the `sp_ereport_ingester` background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R16. This means that if we release R16 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds: ``` 20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response background_task = sp_ereport_ingester gateway_url = http://[fd00:1122:3344:108::2]:12225 result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} }) 20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } background_task = sp_ereport_ingester committed_ena = None error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } file = nexus/src/app/background/tasks/ereport_ingester.rs:380 gateway_addr = [fd00:1122:3344:108::2]:12225 restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart) slot = 29 sp_type = sled start_ena = None ``` Similarly, MGS will also have a bunch of noisy complaints about these requests failing. The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R16's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.

@cbiffle

Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in #2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs

@cbiffle

Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in oxidecomputer#2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs

hawkw force-pushed the eliza/sp-ereport-ingester branch from 19c1c34 to 9366e95 Compare June 9, 2025 17:22

hawkw added 29 commits June 13, 2025 11:42

start ereport ingestion task

094c381

start ereport db tables

b824f38

queries

42fa98d

draw the rest of the owl

cd78fc3

well, most of it, anyway

actually attach the owl to the rest of the software

cbdb669

claude, you forgot to clone the Arc

2cc22c1

it's because you just predict tokens and don't actually understand ownership btw

rm unused type alias

cd2ce89

claude forgot that

5434f77

store part/serial numbers in CRDB if present

0af128d

claude forgot that one too lol

c8b058c

right, serde-json map keys must be strings...

216652c

oopsie

65cccef

argh

49fec6a

omdb ereport show command

a240529

enough of a list command for demo purposes

9737b56

give one of fake the switches an ereport so the table is less boring

2fc3ed7

demo checkpoint

a11b981

add count to ereport list

cb70e51

start omdb status command

2775725

finish up status command

094c34f

include whole error cause chain in statuses

93fb64d

make ereports soft-deleteable

56ee5f3

stuff

b96e2e8

nicer reporter list query

69d6e82

start adding a test

ce2858a

fix incorrect detection of non-existing SP response

9887a99

log errors

1698fef

fix ereport serialization that serde_urlencoded dislikes

fe6e446

reticulating queries

c9a89c1

add tests for parsing ENAs, make sure FromStr always roundtrips

d9f8019

jgallagher reviewed Jun 18, 2025

View reviewed changes

add WHERE time_deleted IS NULL to ereport indices

06ce0a8

smklein approved these changes Jun 18, 2025

View reviewed changes

hawkw added 4 commits June 23, 2025 10:40

get list of SP IDs from MGS instead of making them up

c2914b1

as suggested by @jgallagher in #8296 (comment). this way, we no longer hard-code SP IDs in Nexus, which seems much ncicer.

remove duplicate SP type/slot keys in logs

543a99f

these are now added to the OpContext metadata, so no sense in repeating them.

various documentation embetterment

bd8c5a5

hawkw requested a review from jgallagher June 23, 2025 18:19

jgallagher approved these changes Jun 23, 2025

View reviewed changes

hawkw added 5 commits June 23, 2025 12:55

omdb output wiggling

b7322b9

change ereport-by-time-collected indices to USING HASH

5e676c6

turns out our CRDB version doesn't support hash-sharded indices :(

778bb67

add class column to ereports

a134491

Merge branch 'main' into eliza/sp-ereport-ingester

47d6b30

hawkw enabled auto-merge (squash) June 24, 2025 21:05

update OMDB expectorate output again

917c6ed

hawkw merged commit 66016ca into main Jun 25, 2025
17 checks passed

hawkw deleted the eliza/sp-ereport-ingester branch June 25, 2025 00:30

hawkw mentioned this pull request Jul 9, 2025

packrat ereport storage and snitch implementation oxidecomputer/hubris#2126

Merged

hawkw mentioned this pull request Jul 28, 2025

[nexus] config flag to disable SP ereport ingestion #8709

Merged

hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Nov 11, 2025

[nexus] SP ereport ingestion #8296

[nexus] SP ereport ingestion #8296

Uh oh!

Conversation

hawkw commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw commented Jun 18, 2025

Uh oh!

hawkw commented Jun 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

hawkw commented Jun 23, 2025

Uh oh!

hawkw commented Jun 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hawkw commented Jun 9, 2025 •

edited

Loading