-
Notifications
You must be signed in to change notification settings - Fork 62
[nexus] SP ereport ingestion #8296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
19c1c34 to
9366e95
Compare
well, most of it, anyway
it's because you just predict tokens and don't actually understand ownership btw
|
CI failure for 372d582 was that weird Tokio "Error 0" issue again, restarted it. |
|
@smklein I think I've addressed or responded to all your comments, would love another look when you have the time! |
schema/crdb/dbinit.sql
Outdated
| sp_type, | ||
| sp_slot, | ||
| time_collected | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this index guard on time_deleted IS NULL? (Same question on the host table index below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably; I'll add that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 06ce0a8!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, adding this made the "get latest ereport ID for SP slot" queries fail, as they were no longer hitting the index and doing a full table scan. i had to add a filter clause on time_deleted.is_null() in b1bce52. i'm a bit sketched out by that, because it means those queries might not actually see the actual latest ID if we've deleted it.
the consequences of this are that if we delete an ereport from the current restart generation, we will keep asking the reporter for ereports starting at an earlier ENA, and then the reporter may give us the same ereport again. then, when we go to insert that ereport into the DB, nothing will happen, since another row with the same primary key is already there...but we'll continue asking for the previous ENA unless the SP gave us something from later than the deleted record.
i think this will be okay depending on the conditions under which we delete ereports, which haven't really been worked out yet. i think we would probably want to only delete ereports from older restarts, and not drop anything from the current restart of the reporter, as those are necessary to ensure we're always requesting the latest ENA.
alternatively, we could remove the WHERE time_deleted IS NULL clause here, so that deleted ereports still count for the purposes of determining the latest ID. that way, this would always do the right thing even in the face of Nexus deleting stuff, but it means the index is a lot bigger, and kind of violates the principle of "if something is soft deleted, you should always behave as though it's actually hard-deleted (since we might come along and hard-delete stuff that's soft deleted at any arbitrary point in time)".
another alternative would be to add a new table storing the last seen ENA and restart ID from each reporter, which is updated every time stuff is inserted, rather than getting it from an index. i wasn't sure about that as it seemed like it required the insert operation to be a transaction, which i wanted to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another alternative would be to add a new table storing the last seen ENA and restart ID from each reporter, which is updated every time stuff is inserted, rather than getting it from an index. i wasn't sure about that as it seemed like it required the insert operation to be a transaction, which i wanted to avoid.
hmm, i suppose another approach to this would be to use a CRDB trigger to update a table of "latest (restart ID, ena) tuples from each reporter" every time ereports are inserted. i'm not totally sure what the tradeoffs of this are --- i haven't noticed any use of triggers elsewhere in Nexus (though I may have overlooked something), but i'm not sure whether this is because we have reasons to avoid them or just because we haven't needed them yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of violates the principle of "if something is soft deleted, you should always behave as though it's actually hard-deleted (since we might come along and hard-delete stuff that's soft deleted at any arbitrary point in time)".
This is my biggest concern. I'm very, very skeptical of queries that don't filter out soft deleted items. I know we have some that are effectively doing garbage collection, but those are okay in that they still work correctly if the rows have been hard deleted. I don't think it's correct for a query's correctness to depend on not hard deleting soft-deleted rows.
Restricting deletes seems fine in principle - we do this for blueprints (i.e., you cannot delete the current target blueprint). Although maybe it's more involved here to know which things can and can't be deleted?
another alternative would be to add a new table storing the last seen ENA and restart ID from each reporter, which is updated every time stuff is inserted, rather than getting it from an index. i wasn't sure about that as it seemed like it required the insert operation to be a transaction, which i wanted to avoid.
This also seems like it'd probably be fine? Agreed it makes inserts more expensive, so not doing it would be great, but (presumably...?) inserts are relatively rare?
hmm, i suppose another approach to this would be to use a CRDB trigger to update a table of "latest (restart ID, ena) tuples from each reporter" every time ereports are inserted. i'm not totally sure what the tradeoffs of this are --- i haven't noticed any use of triggers elsewhere in Nexus (though I may have overlooked something), but i'm not sure whether this is because we have reasons to avoid them or just because we haven't needed them yet?
I think this is intentional but I'm not finding any docs explaining it. Hand-wavy concerns about how triggers might perform, maybe?
| time_collected, | ||
| time_deleted, | ||
| collector_id: collector_id.into(), | ||
| part_number: None, // TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the TODO for this PR or future work? (If the latter, maybe worth filing an issue to link to?)
For my own curiosity: would we expect this to be the model number (which ultimately comes from the SP)?
BRM27230037 # /usr/platform/oxide/bin/ipcc ident
Serial: 'BRM27230037'
Model: '913-0000019'
Rev: 0xd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is supposed to be the model number — for host OS ereports, where we have a sled UUID, I was going to do a JOIN to get this from the inventory table, but since we're not currently collecting host OS ereports, I didn't add that yet. I'll make the comment reflect that, or just go and do it...
| /// | ||
| /// This function queries both the service-processor and host OS ereport | ||
| /// tables, and returns a `NotFound` error if neither table contains an | ||
| /// ereport with the requested ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if this is a really dumb question, but - do we expect callers to have an ID without knowing what the source of that report was at a high level (i.e., "an SP" or "a host")? I assume the most salient bits of the report are going to be inside the JSON payload, and presumably those will need to be interpreted based on the source too?
I'm generally fuzzy on when we'd want to view a pile of ereports from various sources vs splitting it out by source ("here are the SP ereports", "here are the host ereports", ...). I would have assumed more of the latter, but I haven't kept up on all the background here so might be way off base!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think we are likelier to want to look up ereports either by identity (i.e. "give me all the ereports from BRM690420, whether from the SP or host system, and regardless of where that serial currently is located""), or by location (i.e. "give me all ereports from sleds in cubby 9 over the last two weeks") than we are to want to look up host OS and SP ereports separately, so I've tried to provide queries based on those use cases.
I agree that at present, there isn't much of a use-case for a "fetch a single ereport by ID" query, since...how would you know what that ereport's ID is without already having it? But, I anticipate this being used in the future when we might have a list of ereports that are linked/related to higher level entities (faults/active problems); you might have some way of saying "these are the IDs of ereports that were involved in the diagnosis of this fault" and then go and query them to get the raw data. That was how I expected this query would be used in the future.
Most of the queries in this module aren't actually being used currently, but I wanted to sketch them out now to gain a sense for how this will work later. I anticipate that they may change somewhat as we start actually needing to query this data...
smklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on my end!
this ensures that the index is actually used --- the change in 06ce0a8 added `WHERE time_deleted IS NULL` to the indices to look up ereports by sled/SP slot, and this made the latest-ereport queries start doing a full table scan. adding the filter clause fixes that.
as suggested by @jgallagher in #8296 (comment). this way, we no longer hard-code SP IDs in Nexus, which seems much ncicer.
these are now added to the OpContext metadata, so no sense in repeating them.
jgallagher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM - thanks for the extra docs explaining the JSON blob.
I'm fine with the current state of time_deleted - filter out soft-deleted things, and (presumably) impose restrictions on what things can be deleted once we add deletion? Also seems fine to punt the choice of restrict delete / add table tracking ENAs / investigate triggers to a later date, since we don't currently delete at all?
Yeah, I figured it was okay to punt on it for the time being, but wanted to write down my thoughts as it's an open question that we'll need to figure out eventually. |
|
Looks like one of the CI failures on 47d6b30 is my fault, as I forgot to update the OMDB expectorate tests (whoops), but the other one didn't get that far, and instead ran into some kind of buildomat sadness. |
PR #8296 added the `sp_ereport_ingester` background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R17. This means that if we release R17 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds: ``` 20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response background_task = sp_ereport_ingester gateway_url = http://[fd00:1122:3344:108::2]:12225 result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} }) 20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } background_task = sp_ereport_ingester committed_ena = None error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } file = nexus/src/app/background/tasks/ereport_ingester.rs:380 gateway_addr = [fd00:1122:3344:108::2]:12225 restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart) slot = 29 sp_type = sled start_ena = None ``` Similarly, MGS will also have a bunch of noisy complaints about these requests failing. The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R17's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.
PR #8296 added the `sp_ereport_ingester` background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R16. This means that if we release R16 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds: ``` 20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response background_task = sp_ereport_ingester gateway_url = http://[fd00:1122:3344:108::2]:12225 result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} }) 20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } background_task = sp_ereport_ingester committed_ena = None error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } file = nexus/src/app/background/tasks/ereport_ingester.rs:380 gateway_addr = [fd00:1122:3344:108::2]:12225 restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart) slot = 29 sp_type = sled start_ena = None ``` Similarly, MGS will also have a bunch of noisy complaints about these requests failing. The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R16's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.
Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in #2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in oxidecomputer#2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
This branch adds a Nexus background task for ingesting ereports from
service processors via MGS, using the MGS API endpoint added in #7903.
These APIs in turn expose the MGS/SP ereport ingestion protocol added in
oxidecomputer/management-gateway-service#370.
For more information on the protocol itself, refer to the following
RFDs:
In addition to the ereport ingester background task, this branch also
adds database tables for storing ereports from SPs, which are necessary
to implement the ingestion task. I've also added a table for storing
ereports from the sled host OS, which will eventually be ingested via
sled-agent. While there isn't currently anything that populates that
table, I wanted to begin sketching out how we would represent the two
categories of ereports we expect to deal with, and how we would query
both tables for ereports.
Finally, this branch also adds OMDB commands for querying the ereports
stored in the database. These OMDB commands may be useful both for
debugging the ereport ingestion subsystem itself and for diagnosing
issues once the SP firmware actually emits ereports. At present, the
higher-level components of the fault-management subsystem, which will
process ereports, diagnose faults, and generate alerts, have yet to be
implemented. Therefore, the OMDB ereport commands serve as an interim
solution for accessing the lower-level data, which may be useful for
debugging such faults until the higher-level FMA components exist.