fix: do not fail daemon shutdown on OTel span flush errors#390
fix: do not fail daemon shutdown on OTel span flush errors#390andreabadesso merged 2 commits intomasterfrom
Conversation
Losing buffered telemetry during shutdown is not a service failure. When the OTLP endpoint is unreachable at shutdown time (e.g. Jaeger outage, network blip), the exporter throws an AggregateError. The previous handler logged it at error level and exited with code 1, which both tripped the "errors-in-logs" alert pattern and marked the pod's termination as unhealthy. Swallow the flush failure, log it as a non-fatal warning, and always exit 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe daemon tracing module's shutdown handler was changed to always call Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/daemon/src/tracing.ts`:
- Line 61: The current shutdown warn call passes the raw error object
(console.warn('OTel SDK: failed to flush spans during shutdown (non-fatal):',
err)), which can emit stack traces that trigger alert regexes; change the call
in the shutdown/flush handler to avoid logging the full error object—either log
a generic message only (e.g., "OTel SDK: failed to flush spans during shutdown
(non-fatal)") or log a sanitized value such as err?.message or err?.name (but
not the full err or stack) so console.warn does not output the raw error object.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8f528e09-b869-42da-9c23-de5a2904d36c
📒 Files selected for processing (1)
packages/daemon/src/tracing.ts
The previous catch passed the raw `err` to console.warn, which emits its stack trace. That stack typically contains strings like "AggregateError" / "Error:" and would still be picked up by the log-based alert regex, defeating the whole point of this handler. Drop the binding entirely and log a bland, fixed-text message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Good catch @coderabbitai — this is the exact failure mode that fired the original alert: the log line containing the stringified error object ( Pushed d9102a9:
Trade-off: we lose diagnostic detail at shutdown time. If an operator ever needs to debug why a shutdown flush is failing, the right path is to check the OTLP endpoint health rather than grep the daemon logs — telemetry-loss-on-shutdown is expected behaviour when the collector is down. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 5155
try {
await sdk.shutdown();
} catch {
console.warn('OTel flush skipped during shutdown (non-fatal)');
}
process.exit(0);A few things that stand out as well-handled:
The trade-off you described (no shutdown-time diagnostics → check the OTLP endpoint health out-of-band) is a sound operational model and well-justified by the ✏️ Learnings added
|
Motivation
When the OTLP endpoint is unreachable at shutdown time (e.g. Jaeger outage, network blip, pod eviction during a collector restart),
sdk.shutdown()throws anAggregateError [ECONNREFUSED]from the exporter's final flush. The handler added in #383 logged it aterrorlevel and calledprocess.exit(1), which has two bad consequences:errortrips the existingerrors-in-logsalert pattern, paging on-call for a transient collector outage.Losing a handful of buffered telemetry spans during shutdown is not a service failure. This was surfaced in
devduring an OTLP endpoint session where the collector briefly went away; the same class of alert will fire againstprodthe first time Jaeger has any hiccup.Acceptance Criteria
sdk.shutdown()failures during SIGTERM/SIGINT no longer cause the daemon to exit with code 1warnlevel with a message that does not contain the worderror(so theerrors-in-logsalert pattern is not triggered)OTEL_EXPORTER_OTLP_ENDPOINTpointed at an unreachable host terminates cleanly on SIGTERM (exit 0, no error-level log)Checklist
master, confirm this code is production-ready and can be included in future releases as soon as it gets mergedGenerated with Claude Code
Summary by CodeRabbit