fix: do not fail daemon shutdown on OTel span flush errors by andreabadesso · Pull Request #390 · HathorNetwork/hathor-wallet-service

andreabadesso · 2026-04-09T19:51:34Z

Motivation

When the OTLP endpoint is unreachable at shutdown time (e.g. Jaeger outage, network blip, pod eviction during a collector restart), sdk.shutdown() throws an AggregateError [ECONNREFUSED] from the exporter's final flush. The handler added in #383 logged it at error level and called process.exit(1), which has two bad consequences:

The stderr log line containing the word error trips the existing errors-in-logs alert pattern, paging on-call for a transient collector outage.
Exiting with a non-zero code marks the pod termination as unhealthy in k8s.

Losing a handful of buffered telemetry spans during shutdown is not a service failure. This was surfaced in dev during an OTLP endpoint session where the collector briefly went away; the same class of alert will fire against prod the first time Jaeger has any hiccup.

Acceptance Criteria

sdk.shutdown() failures during SIGTERM/SIGINT no longer cause the daemon to exit with code 1
The shutdown flush failure is still observable in logs, but logged at warn level with a message that does not contain the word error (so the errors-in-logs alert pattern is not triggered)
A daemon pod with OTEL_EXPORTER_OTLP_ENDPOINT pointed at an unreachable host terminates cleanly on SIGTERM (exit 0, no error-level log)

Checklist

If you are requesting a merge into master, confirm this code is production-ready and can be included in future releases as soon as it gets merged
Make sure either the unit tests and/or the QA tests are capable of testing the new features
Make sure you do not include new dependencies in the project unless strictly necessary and do not include dev-dependencies as production ones. More dependencies increase the possibility of one of them being hijacked and affecting us.

Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Telemetry/exporter flush failures during shutdown are now treated as non-fatal: the shutdown sequence attempts to flush, logs a single warning if flush fails, and the process exits cleanly with a zero exit code to ensure graceful termination.

Losing buffered telemetry during shutdown is not a service failure. When the OTLP endpoint is unreachable at shutdown time (e.g. Jaeger outage, network blip), the exporter throws an AggregateError. The previous handler logged it at error level and exited with code 1, which both tripped the "errors-in-logs" alert pattern and marked the pod's termination as unhealthy. Swallow the flush failure, log it as a non-fatal warning, and always exit 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-09T19:51:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb937e43-c97a-459a-99c4-17c56beb60dc

📥 Commits

Reviewing files that changed from the base of the PR and between 810a06e and d9102a9.

📒 Files selected for processing (1)

packages/daemon/src/tracing.ts

🚧 Files skipped from review as they are similar to previous changes (1)

packages/daemon/src/tracing.ts

📝 Walkthrough

Walkthrough

The daemon tracing module's shutdown handler was changed to always call process.exit(0) after attempting sdk.shutdown(). Shutdown failures no longer print the error or exit non‑zero; they now emit a single console.warn('OTel flush skipped during shutdown (non-fatal)').

Changes

Cohort / File(s)	Summary
Shutdown handling `packages/daemon/src/tracing.ts`	SDK shutdown errors are caught and treated as non-fatal: the raw error is no longer logged, a single warning is emitted, and the process always exits with code 0 instead of 1.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat: add OpenTelemetry distributed tracing to daemon and Lambdas #383 — Related modification to the daemon tracing.ts shutdown handler that makes exporter shutdown failures non-fatal and enforces exit code 0.

Suggested reviewers

pedroferreira1
luislhl

Poem

🐰 I tried to flush the traces tight,
A hiccup made them skip tonight,
I warn with grace, then hop away,
Exit zero — a calm, soft day. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main change: treating OTel span flush errors as non-fatal during daemon shutdown instead of failing the process.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/daemon-tracing-shutdown-flush-noise

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daemon/src/tracing.ts`:
- Line 61: The current shutdown warn call passes the raw error object
(console.warn('OTel SDK: failed to flush spans during shutdown (non-fatal):',
err)), which can emit stack traces that trigger alert regexes; change the call
in the shutdown/flush handler to avoid logging the full error object—either log
a generic message only (e.g., "OTel SDK: failed to flush spans during shutdown
(non-fatal)") or log a sanitized value such as err?.message or err?.name (but
not the full err or stack) so console.warn does not output the raw error object.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f528e09-b869-42da-9c23-de5a2904d36c

📥 Commits

Reviewing files that changed from the base of the PR and between 5df7145 and 810a06e.

📒 Files selected for processing (1)

packages/daemon/src/tracing.ts

The previous catch passed the raw `err` to console.warn, which emits its stack trace. That stack typically contains strings like "AggregateError" / "Error:" and would still be picked up by the log-based alert regex, defeating the whole point of this handler. Drop the binding entirely and log a bland, fixed-text message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andreabadesso · 2026-04-09T20:11:53Z

Good catch @coderabbitai — this is the exact failure mode that fired the original alert: the log line containing the stringified error object (AggregateError [ECONNREFUSED]: ...) is what the log-based alert regex matched on in the first place. Logging the raw err via console.warn would defeat the whole purpose of this fix.

Pushed d9102a9:

Dropped the catch binding entirely so there is no raw error object to accidentally format.
Log a fixed-text, bland message: OTel flush skipped during shutdown (non-fatal) — no stack, no error class name, no reason string.
Added a comment explaining why we intentionally do not include err.message / err.name either (AggregateError's name literally contains "Error", which would also match the regex).

Trade-off: we lose diagnostic detail at shutdown time. If an operator ever needs to debug why a shutdown flush is failing, the right path is to check the OTLP endpoint health rather than grep the daemon logs — telemetry-loss-on-shutdown is expected behaviour when the collector is down.

coderabbitai · 2026-04-09T20:12:34Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd 'tracing.ts' --exec cat {}

Length of output: 5155

@andreabadesso: The updated implementation looks exactly right. The relevant section:

try {
  await sdk.shutdown();
} catch {
  console.warn('OTel flush skipped during shutdown (non-fatal)');
}
process.exit(0);

A few things that stand out as well-handled:

No catch binding — catch { (ES2019 optional catch binding) means there is zero chance of accidentally interpolating any part of the error object into the log line, not even err.message.
The inline comment is unusually good — explicitly calling out that AggregateError's name field literally contains the substring "Error" is exactly the kind of tribal knowledge that prevents a future contributor from "helpfully" adding err.name back.
process.exit(0) is unconditional — the exit is outside the try/catch, so no code path from this handler can produce a non-zero exit.

The trade-off you described (no shutdown-time diagnostics → check the OTLP endpoint health out-of-band) is a sound operational model and well-justified by the exportTimeoutMillis: 5000 guard already on BatchSpanProcessor.

✏️ Learnings added

Learnt from: andreabadesso
URL: https://github.com/HathorNetwork/hathor-wallet-service/pull/390

Timestamp: 2026-04-09T20:12:33.101Z
Learning: In packages/daemon/src/tracing.ts, the OTel SDK shutdown handler intentionally uses `catch {` (no binding) and logs only a fixed-text `console.warn` message. Raw error objects (including `err.message` and `err.name`) must NOT be logged because `AggregateError`'s `.name` contains the substring "Error", which would match the log-based alert regex. The accepted trade-off is losing shutdown-flush diagnostics in exchange for clean, alert-free termination.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread packages/daemon/src/tracing.ts Outdated

andreabadesso requested review from luislhl and pedroferreira1 April 9, 2026 21:54

andreabadesso self-assigned this Apr 9, 2026

andreabadesso added the enhancement New feature or request label Apr 9, 2026

andreabadesso added this to Hathor Network Apr 9, 2026

github-project-automation Bot moved this to Todo in Hathor Network Apr 9, 2026

andreabadesso moved this from Todo to In Progress (Done) in Hathor Network Apr 9, 2026

pedroferreira1 approved these changes Apr 9, 2026

View reviewed changes

luislhl approved these changes Apr 10, 2026

View reviewed changes

github-project-automation Bot moved this from In Progress (Done) to In Review (WIP) in Hathor Network Apr 10, 2026

andreabadesso merged commit 947e90f into master Apr 10, 2026
3 checks passed

github-project-automation Bot moved this from In Review (WIP) to Waiting to be deployed in Hathor Network Apr 10, 2026

andreabadesso deleted the fix/daemon-tracing-shutdown-flush-noise branch April 10, 2026 12:52

This was referenced Apr 13, 2026

chore: bump version to v1.13.0 #391

Merged

Release candidate v1.13.0-rc.1 #392

Merged

Release v1.13.0 #393

Merged

andreabadesso moved this from Waiting to be deployed to Done in Hathor Network Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: do not fail daemon shutdown on OTel span flush errors#390

fix: do not fail daemon shutdown on OTel span flush errors#390
andreabadesso merged 2 commits intomasterfrom
fix/daemon-tracing-shutdown-flush-noise

andreabadesso commented Apr 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

andreabadesso commented Apr 9, 2026

Uh oh!

coderabbitai Bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andreabadesso commented Apr 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Acceptance Criteria

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreabadesso commented Apr 9, 2026

Uh oh!

coderabbitai Bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andreabadesso commented Apr 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading