-
Notifications
You must be signed in to change notification settings - Fork 61
[nexus] config flag to disable SP ereport ingestion #8709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR #8296 added the `sp_ereport_ingester` background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R17. This means that if we release R17 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds: ``` 20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response background_task = sp_ereport_ingester gateway_url = http://[fd00:1122:3344:108::2]:12225 result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} }) 20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } background_task = sp_ereport_ingester committed_ena = None error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } file = nexus/src/app/background/tasks/ereport_ingester.rs:380 gateway_addr = [fd00:1122:3344:108::2]:12225 restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart) slot = 29 sp_type = sled start_ena = None ``` Similarly, MGS will also have a bunch of noisy complaints about these requests failing. The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R17's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM; looks like the failing tests are some other configs that need the new field?
(Also just checking: all the "R17"s in the PR description should be "R16" right?)
Co-authored-by: John Gallagher <[email protected]>
Oh, I see the problem, I thought I had made it default to
Agh, yeah, good catch. |
For R16, the `sp_ereport_ingester` background task was disabled in the production Nexus config file (see #8709). This is because the corresponding Hubris code for evacuating ereports from the SP had not yet merged, resulting in Nexus yelling constantly about trying to collect ereports from SPs that weren't listening on the ereport port. Now, however, oxidecomputer/hubris#2126 has merged, and R16 has been cut, so we can turn this back on. This commit does that.
For R16, the `sp_ereport_ingester` background task was disabled in the production Nexus config file (see #8709). This is because the corresponding Hubris code for evacuating ereports from the SP had not yet merged, resulting in Nexus yelling constantly about trying to collect ereports from SPs that weren't listening on the ereport port. Now, however, oxidecomputer/hubris#2126 has merged, and R16 has been cut, so we can turn this back on. This commit does that.
PR #8296 added the
sp_ereport_ingesterbackground task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R17. This means that if we release R17 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds:Similarly, MGS will also have a bunch of noisy complaints about these requests failing.
The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R17's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.