Skip to content

fix: do not fail daemon shutdown on OTel span flush errors#390

Merged
andreabadesso merged 2 commits intomasterfrom
fix/daemon-tracing-shutdown-flush-noise
Apr 10, 2026
Merged

fix: do not fail daemon shutdown on OTel span flush errors#390
andreabadesso merged 2 commits intomasterfrom
fix/daemon-tracing-shutdown-flush-noise

Conversation

@andreabadesso
Copy link
Copy Markdown
Collaborator

@andreabadesso andreabadesso commented Apr 9, 2026

Motivation

When the OTLP endpoint is unreachable at shutdown time (e.g. Jaeger outage, network blip, pod eviction during a collector restart), sdk.shutdown() throws an AggregateError [ECONNREFUSED] from the exporter's final flush. The handler added in #383 logged it at error level and called process.exit(1), which has two bad consequences:

  1. The stderr log line containing the word error trips the existing errors-in-logs alert pattern, paging on-call for a transient collector outage.
  2. Exiting with a non-zero code marks the pod termination as unhealthy in k8s.

Losing a handful of buffered telemetry spans during shutdown is not a service failure. This was surfaced in dev during an OTLP endpoint session where the collector briefly went away; the same class of alert will fire against prod the first time Jaeger has any hiccup.

Acceptance Criteria

  • sdk.shutdown() failures during SIGTERM/SIGINT no longer cause the daemon to exit with code 1
  • The shutdown flush failure is still observable in logs, but logged at warn level with a message that does not contain the word error (so the errors-in-logs alert pattern is not triggered)
  • A daemon pod with OTEL_EXPORTER_OTLP_ENDPOINT pointed at an unreachable host terminates cleanly on SIGTERM (exit 0, no error-level log)

Checklist

  • If you are requesting a merge into master, confirm this code is production-ready and can be included in future releases as soon as it gets merged
  • Make sure either the unit tests and/or the QA tests are capable of testing the new features
  • Make sure you do not include new dependencies in the project unless strictly necessary and do not include dev-dependencies as production ones. More dependencies increase the possibility of one of them being hijacked and affecting us.

Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Telemetry/exporter flush failures during shutdown are now treated as non-fatal: the shutdown sequence attempts to flush, logs a single warning if flush fails, and the process exits cleanly with a zero exit code to ensure graceful termination.

Losing buffered telemetry during shutdown is not a service failure.
When the OTLP endpoint is unreachable at shutdown time (e.g. Jaeger
outage, network blip), the exporter throws an AggregateError. The
previous handler logged it at error level and exited with code 1,
which both tripped the "errors-in-logs" alert pattern and marked the
pod's termination as unhealthy.

Swallow the flush failure, log it as a non-fatal warning, and always
exit 0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb937e43-c97a-459a-99c4-17c56beb60dc

📥 Commits

Reviewing files that changed from the base of the PR and between 810a06e and d9102a9.

📒 Files selected for processing (1)
  • packages/daemon/src/tracing.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/daemon/src/tracing.ts

📝 Walkthrough

Walkthrough

The daemon tracing module's shutdown handler was changed to always call process.exit(0) after attempting sdk.shutdown(). Shutdown failures no longer print the error or exit non‑zero; they now emit a single console.warn('OTel flush skipped during shutdown (non-fatal)').

Changes

Cohort / File(s) Summary
Shutdown handling
packages/daemon/src/tracing.ts
SDK shutdown errors are caught and treated as non-fatal: the raw error is no longer logged, a single warning is emitted, and the process always exits with code 0 instead of 1.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • pedroferreira1
  • luislhl

Poem

🐰 I tried to flush the traces tight,
A hiccup made them skip tonight,
I warn with grace, then hop away,
Exit zero — a calm, soft day. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: treating OTel span flush errors as non-fatal during daemon shutdown instead of failing the process.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/daemon-tracing-shutdown-flush-noise

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daemon/src/tracing.ts`:
- Line 61: The current shutdown warn call passes the raw error object
(console.warn('OTel SDK: failed to flush spans during shutdown (non-fatal):',
err)), which can emit stack traces that trigger alert regexes; change the call
in the shutdown/flush handler to avoid logging the full error object—either log
a generic message only (e.g., "OTel SDK: failed to flush spans during shutdown
(non-fatal)") or log a sanitized value such as err?.message or err?.name (but
not the full err or stack) so console.warn does not output the raw error object.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f528e09-b869-42da-9c23-de5a2904d36c

📥 Commits

Reviewing files that changed from the base of the PR and between 5df7145 and 810a06e.

📒 Files selected for processing (1)
  • packages/daemon/src/tracing.ts

Comment thread packages/daemon/src/tracing.ts Outdated
The previous catch passed the raw `err` to console.warn, which emits
its stack trace. That stack typically contains strings like
"AggregateError" / "Error:" and would still be picked up by the
log-based alert regex, defeating the whole point of this handler.

Drop the binding entirely and log a bland, fixed-text message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andreabadesso
Copy link
Copy Markdown
Collaborator Author

Good catch @coderabbitai — this is the exact failure mode that fired the original alert: the log line containing the stringified error object (AggregateError [ECONNREFUSED]: ...) is what the log-based alert regex matched on in the first place. Logging the raw err via console.warn would defeat the whole purpose of this fix.

Pushed d9102a9:

  • Dropped the catch binding entirely so there is no raw error object to accidentally format.
  • Log a fixed-text, bland message: OTel flush skipped during shutdown (non-fatal) — no stack, no error class name, no reason string.
  • Added a comment explaining why we intentionally do not include err.message / err.name either (AggregateError's name literally contains "Error", which would also match the regex).

Trade-off: we lose diagnostic detail at shutdown time. If an operator ever needs to debug why a shutdown flush is failing, the right path is to check the OTLP endpoint health rather than grep the daemon logs — telemetry-loss-on-shutdown is expected behaviour when the collector is down.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd 'tracing.ts' --exec cat {}

Length of output: 5155


@andreabadesso: The updated implementation looks exactly right. The relevant section:

try {
  await sdk.shutdown();
} catch {
  console.warn('OTel flush skipped during shutdown (non-fatal)');
}
process.exit(0);

A few things that stand out as well-handled:

  1. No catch bindingcatch { (ES2019 optional catch binding) means there is zero chance of accidentally interpolating any part of the error object into the log line, not even err.message.
  2. The inline comment is unusually good — explicitly calling out that AggregateError's name field literally contains the substring "Error" is exactly the kind of tribal knowledge that prevents a future contributor from "helpfully" adding err.name back.
  3. process.exit(0) is unconditional — the exit is outside the try/catch, so no code path from this handler can produce a non-zero exit.

The trade-off you described (no shutdown-time diagnostics → check the OTLP endpoint health out-of-band) is a sound operational model and well-justified by the exportTimeoutMillis: 5000 guard already on BatchSpanProcessor.


✏️ Learnings added
Learnt from: andreabadesso
URL: https://github.com/HathorNetwork/hathor-wallet-service/pull/390

Timestamp: 2026-04-09T20:12:33.101Z
Learning: In packages/daemon/src/tracing.ts, the OTel SDK shutdown handler intentionally uses `catch {` (no binding) and logs only a fixed-text `console.warn` message. Raw error objects (including `err.message` and `err.name`) must NOT be logged because `AggregateError`'s `.name` contains the substring "Error", which would match the log-based alert regex. The accepted trade-off is losing shutdown-flush diagnostics in exchange for clean, alert-free termination.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@andreabadesso andreabadesso self-assigned this Apr 9, 2026
@andreabadesso andreabadesso added the enhancement New feature or request label Apr 9, 2026
@andreabadesso andreabadesso moved this from Todo to In Progress (Done) in Hathor Network Apr 9, 2026
@github-project-automation github-project-automation Bot moved this from In Progress (Done) to In Review (WIP) in Hathor Network Apr 10, 2026
@andreabadesso andreabadesso merged commit 947e90f into master Apr 10, 2026
3 checks passed
@github-project-automation github-project-automation Bot moved this from In Review (WIP) to Waiting to be deployed in Hathor Network Apr 10, 2026
@andreabadesso andreabadesso deleted the fix/daemon-tracing-shutdown-flush-noise branch April 10, 2026 12:52
@andreabadesso andreabadesso moved this from Waiting to be deployed to Done in Hathor Network Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants