Skip to content

feat: add OpenTelemetry distributed tracing to daemon and Lambdas#383

Merged
andreabadesso merged 11 commits intomasterfrom
feat/daemon-otel-tracing
Apr 9, 2026
Merged

feat: add OpenTelemetry distributed tracing to daemon and Lambdas#383
andreabadesso merged 11 commits intomasterfrom
feat/daemon-otel-tracing

Conversation

@andreabadesso
Copy link
Copy Markdown
Collaborator

@andreabadesso andreabadesso commented Mar 14, 2026

Implements RFC 0001-opentelemetry-observability for both the daemon (K8s) and wallet-service (Lambda) packages.

Summary

Daemon (packages/daemon)

  • OTel SDK init in tracing.ts with BatchSpanProcessor + OTLP HTTP exporter
  • Auto-instrumentation for mysql2, http, aws-sdk (fs/dns disabled)
  • Manual spans on all critical event handlers (handleVertexAccepted, handleVoidedTx, handleReorgStarted, etc.) via withSpan() helper
  • Root event.processing.* spans via tracedService() wrapper in SyncMachine — service-level spans become children automatically
  • Winston log correlation: trace_id and span_id injected into log output
  • Graceful shutdown on SIGTERM with span flush
  • Local dev setup: docker-compose.dev.yml with MySQL + Jaeger

Wallet-service Lambdas (packages/wallet-service)

  • Manual SDK init (RFC Option B) — ADOT layer was too large for the 250MB Lambda size limit
  • Cherry-picked instrumentations (aws-lambda, aws-sdk, http, mysql, redis, winston) adding ~39MB vs ~120MB for the full meta-package
  • Webpack entries prepended with tracing.ts so auto-instrumentation patches libraries before import
  • OTEL_SDK_DISABLED=true kill switch to skip all OTel module loading
  • OTEL_CONSOLE_DEBUG=true to print spans to CloudWatch logs for debugging

Bug fix (included)

  • Fix lockExpires Joi validation in push notification handler (.valid(null).allow(null)) — fixes Opsgenie alert #50062

Cold Start Benchmark

A/B test on getLatestBlock (n=7 each, forced cold starts):

Metric WITHOUT OTel WITH OTel Delta
Avg Init 1164 ms 1870 ms +706 ms (+60.7%)

Warm handler overhead (n=10, console debug OFF):

Metric OTel OFF OTel ON Delta
Median 9.6 ms 15.9 ms +6.3 ms
p95 16.1 ms 199.9 ms +184 ms (BatchSpanProcessor spikes)

Full report: packages/wallet-service/docs/otel-cold-start-benchmark.md

Configuration

All tracing is opt-in per environment:

Env var Purpose
OTEL_EXPORTER_OTLP_ENDPOINT Set to collector URL to enable trace export
OTEL_SDK_DISABLED=true Kill switch — skips all OTel initialization
OTEL_CONSOLE_DEBUG=true Print spans to stdout/CloudWatch for debugging
OTEL_SERVICE_NAME Override service name (defaults provided)

Test plan

  • Daemon: TypeScript compiles, ESLint passes
  • Wallet-service: serverless package succeeds under 250MB limit (243MB)
  • Lambda invoke returns 200 with OTel enabled (no runtime errors)
  • Spans visible in CloudWatch with OTEL_CONSOLE_DEBUG=true
  • MySQL auto-instrumentation captures db.statement attributes
  • Cold start benchmark validates overhead within RFC estimates (200-800ms)
  • Existing tests pass (CI green)

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds OpenTelemetry tracing (SDK bootstrap, spans, attributes, error recording) and instruments many daemon service handlers; imports tracing early in startup. Adds development Docker Compose and env file, and five dev npm scripts plus OpenTelemetry dependencies in the daemon package manifest.

Changes

Cohort / File(s) Summary
Package manifest & scripts
packages/daemon/package.json
Adds six OpenTelemetry dependencies and five dev-oriented npm scripts: dev:up, dev:down, dev:destroy, dev:migrate, dev:fetch-ids.
Tracing bootstrap
packages/daemon/src/tracing.ts
New OpenTelemetry NodeSDK initialization: Resource from env, optional OTLP HTTP exporter with BatchSpanProcessor, auto-instrumentations (FS/DNS disabled), SDK start and SIGTERM shutdown handling.
Startup import & tracer
packages/daemon/src/index.ts
Imports ./tracing early; creates tracer and emits span events on state transitions instead of raw logs.
Service handler instrumentation
packages/daemon/src/services/index.ts
Introduces tracer and withSpan helper; wraps many handlers and DB/async operations in spans, adds span attributes (tx.hash, tx.version, token.uid, reorg.size, etc.), records exceptions, sets span status on errors, and ensures spans end on all code paths.
Dev infra & env
packages/daemon/docker-compose.dev.yml, packages/daemon/env.development
Adds dev docker-compose (MySQL, Jaeger) and a development env file with fullnode, DB, AWS/service stub variables and optional OTEL_EXPORTER_OTLP_ENDPOINT.

Sequence Diagram

sequenceDiagram
    participant App as Application (daemon)
    participant Tracing as Tracing Module
    participant SDK as OpenTelemetry SDK
    participant Exporter as OTLP HTTP Exporter
    participant Handler as Instrumented Handler

    App->>Tracing: import './tracing' (early)
    activate Tracing
    Tracing->>SDK: create NodeSDK with Resource
    Tracing->>SDK: configure BatchSpanProcessor (if OTEL_EXPORTER_OTLP_ENDPOINT)
    Tracing->>SDK: enable auto-instrumentations (FS/DNS disabled)
    Tracing->>SDK: start()
    SDK->>Exporter: initialize with OTLP endpoint (when configured)
    deactivate Tracing

    Note over App: Module loading continues

    App->>Handler: invoke event handler
    Handler->>SDK: tracer.startActiveSpan()
    activate SDK
    SDK->>SDK: create span, attach attributes
    Handler->>Handler: execute handler logic
    alt Error occurs
        Handler->>SDK: span.setStatus(ERROR)
        Handler->>SDK: recordException()
    end
    Handler->>SDK: span.end()
    deactivate SDK
    SDK->>Exporter: batch export span
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • pedroferreira1
  • r4mmer

Poem

🐇 I nudged the trace on at the start,

I stitched bright spans into every part,
I hop through handlers, mark each trace,
Catching errors, closing the race,
A small rabbit keeping observant heart.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add OpenTelemetry distributed tracing to daemon and Lambdas' directly describes the main change: adding OpenTelemetry tracing. It is concise, specific, and clearly summarizes the primary objective of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/daemon-otel-tracing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/daemon/src/services/index.ts (1)

330-341: ⚠️ Potential issue | 🟡 Minor

Early return at Line 334 doesn't end the span, causing a span leak.

When dbTx exists and daemon uptime is below the grace period, the function returns early without calling span.end(). The span remains open indefinitely, consuming memory and potentially causing misleading trace data.

Proposed fix
     if (dbTx) {
       const daemonUptime = getDaemonUptime();
       // We do not log if the daemon has just started, because it's expected that
       // we receive an initial duplicate transaction from the fullnode in this case.
-      if (daemonUptime < DUPLICATE_TX_ALERT_GRACE_PERIOD) return;
+      if (daemonUptime < DUPLICATE_TX_ALERT_GRACE_PERIOD) {
+        span.end();
+        return;
+      }

       logger.error(`Transaction ${hash} already in the database and the daemon has not been recently restarted (uptime of ${daemonUptime} seconds). This is unexpected.`);

       // This might happen if the service has been recently restarted,
       // so we should raise the alert and just ignore the tx
+      span.end();
       return;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/services/index.ts` around lines 330 - 341, When dbTx is
truthy and getDaemonUptime() reports uptime below
DUPLICATE_TX_ALERT_GRACE_PERIOD the code currently returns early without closing
the active span, leaking the trace; before any early return in that dbTx
handling block call span.end() to finish the span (and ensure any other
early-return branches in the same function also call span.end() before
returning), so locate the dbTx check and add span.end() just prior to the return
to close the span cleanly.
🧹 Nitpick comments (1)
packages/daemon/src/tracing.ts (1)

24-28: Consider adding exportTimeoutMillis for predictable shutdown behavior.

Without explicit exportTimeoutMillis, the default is 30 seconds. During SIGTERM, this may be too long for container orchestrators with shorter grace periods. Setting a shorter timeout (e.g., 5000ms) ensures faster shutdown.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/tracing.ts` around lines 24 - 28, The BatchSpanProcessor
configuration object (the literal containing maxQueueSize, maxExportBatchSize,
scheduledDelayMillis) should include exportTimeoutMillis set to a shorter value
(e.g., 5000) so span exports don't block shutdown; update the config used when
constructing the BatchSpanProcessor/trace exporter in tracing.ts to add
exportTimeoutMillis: 5000 alongside maxQueueSize, maxExportBatchSize, and
scheduledDelayMillis.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daemon/src/tracing.ts`:
- Around line 38-40: The SIGTERM handler calls sdk.shutdown() without awaiting
it, risking loss of pending spans (e.g., from BatchSpanProcessor); update the
handler that sets up sdk.start() to call and await sdk.shutdown() (or call a
small async wrapper that awaits sdk.shutdown() with a timeout guard) and ensure
this graceful shutdown is coordinated with other shutdown logic (such as
machine.onDone in index.ts) to avoid a synchronous process.exit(1) preempting
the flush; reference sdk.start, sdk.shutdown and BatchSpanProcessor when making
the change and add a timeout fallback if shutdown hangs.
- Around line 20-29: The span processor is always created with OTLPTraceExporter
which defaults to localhost when process.env.OTEL_EXPORTER_OTLP_ENDPOINT is
unset; change initialization so that the BatchSpanProcessor (spanProcessor)
using OTLPTraceExporter is only created when
process.env.OTEL_EXPORTER_OTLP_ENDPOINT is truthy—otherwise do not add the
BatchSpanProcessor (or use a NoopSpanProcessor). Locate the code that constructs
spanProcessor/ new BatchSpanProcessor(...) and wrap it with a conditional branch
checking process.env.OTEL_EXPORTER_OTLP_ENDPOINT before instantiating
OTLPTraceExporter, preserving the existing export options (maxQueueSize,
maxExportBatchSize, scheduledDelayMillis) when created.

---

Outside diff comments:
In `@packages/daemon/src/services/index.ts`:
- Around line 330-341: When dbTx is truthy and getDaemonUptime() reports uptime
below DUPLICATE_TX_ALERT_GRACE_PERIOD the code currently returns early without
closing the active span, leaking the trace; before any early return in that dbTx
handling block call span.end() to finish the span (and ensure any other
early-return branches in the same function also call span.end() before
returning), so locate the dbTx check and add span.end() just prior to the return
to close the span cleanly.

---

Nitpick comments:
In `@packages/daemon/src/tracing.ts`:
- Around line 24-28: The BatchSpanProcessor configuration object (the literal
containing maxQueueSize, maxExportBatchSize, scheduledDelayMillis) should
include exportTimeoutMillis set to a shorter value (e.g., 5000) so span exports
don't block shutdown; update the config used when constructing the
BatchSpanProcessor/trace exporter in tracing.ts to add exportTimeoutMillis: 5000
alongside maxQueueSize, maxExportBatchSize, and scheduledDelayMillis.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c89fedf0-499e-43df-b592-b4fdebc9be50

📥 Commits

Reviewing files that changed from the base of the PR and between 1693af7 and 9ace1e0.

⛔ Files ignored due to path filters (1)
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (4)
  • packages/daemon/package.json
  • packages/daemon/src/index.ts
  • packages/daemon/src/services/index.ts
  • packages/daemon/src/tracing.ts

Comment thread packages/daemon/src/tracing.ts Outdated
Comment thread packages/daemon/src/tracing.ts Outdated
@andreabadesso andreabadesso self-assigned this Mar 14, 2026
@andreabadesso andreabadesso added the enhancement New feature or request label Mar 14, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daemon/src/tracing.ts`:
- Around line 44-50: The SIGTERM handler uses process.on and awaits
sdk.shutdown() but never exits, leaving the process running; replace the
listener with process.once('SIGTERM', ...) to avoid duplicate handlers and after
awaiting sdk.shutdown() (and in the catch block) call process.exit(code) to
ensure termination whether shutdown succeeds or fails, updating the handler
around sdk.shutdown to always call process.exit(0) on success and
process.exit(1) on error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2e166767-cf7b-4206-a639-f0ee679eee68

📥 Commits

Reviewing files that changed from the base of the PR and between 9ace1e0 and 613413c.

📒 Files selected for processing (1)
  • packages/daemon/src/tracing.ts

Comment thread packages/daemon/src/tracing.ts Outdated
@andreabadesso
Copy link
Copy Markdown
Collaborator Author

All CodeRabbit feedback addressed in 5109216:

  • Added exportTimeoutMillis: 5000 to BatchSpanProcessor for predictable shutdown
  • Changed process.on to process.once and added process.exit() after shutdown
  • Fixed span leak on early returns in handleVertexAccepted

@andreabadesso andreabadesso moved this from Todo to In Progress (Done) in Hathor Network Mar 14, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/daemon/src/services/index.ts (1)

1009-1070: ⚠️ Potential issue | 🟡 Minor

Missing error handling may leak span in handleReorgStarted.

Unlike other handlers that use try/catch/finally, this function calls span.end() only in the happy path (line 1068) and early-throw cases (lines 1012, 1018). If any addAlert() call (lines 1035-1065) throws an exception, the span will never be ended, causing a span leak.

Proposed fix to ensure span always ends
 export const handleReorgStarted = async (context: Context): Promise<void> => {
   return tracer.startActiveSpan('handleReorgStarted', async (span) => {
-  if (!context.event) {
-    span.end();
-    throw new Error('No event in context');
-  }
-
-  const fullNodeEvent = context.event;
-  if (fullNodeEvent.event.type !== FullNodeEventTypes.REORG_STARTED) {
-    span.end();
-    throw new Error('Invalid event type for REORG_STARTED');
-  }
-
-  const { reorg_size, previous_best_block, new_best_block, common_block } = fullNodeEvent.event.data;
-
-  span.setAttribute('reorg.size', reorg_size);
-  const { REORG_SIZE_INFO, REORG_SIZE_MINOR, REORG_SIZE_MAJOR, REORG_SIZE_CRITICAL } = getConfig();
-
-  const metadata = {
-    reorg_size,
-    previous_best_block,
-    new_best_block,
-    common_block,
-  };
-
-  if (reorg_size >= REORG_SIZE_CRITICAL) {
-    await addAlert(
-      'Critical Reorg Detected',
-      `A critical reorg of size ${reorg_size} has occurred.`,
-      Severity.CRITICAL,
-      metadata,
-      logger,
-    );
-  } else if (reorg_size >= REORG_SIZE_MAJOR) {
-    // ... remaining alert logic
-  }
-
-  span.end();
+  try {
+    if (!context.event) {
+      throw new Error('No event in context');
+    }
+
+    const fullNodeEvent = context.event;
+    if (fullNodeEvent.event.type !== FullNodeEventTypes.REORG_STARTED) {
+      throw new Error('Invalid event type for REORG_STARTED');
+    }
+
+    const { reorg_size, previous_best_block, new_best_block, common_block } = fullNodeEvent.event.data;
+
+    span.setAttribute('reorg.size', reorg_size);
+    const { REORG_SIZE_INFO, REORG_SIZE_MINOR, REORG_SIZE_MAJOR, REORG_SIZE_CRITICAL } = getConfig();
+
+    const metadata = {
+      reorg_size,
+      previous_best_block,
+      new_best_block,
+      common_block,
+    };
+
+    if (reorg_size >= REORG_SIZE_CRITICAL) {
+      await addAlert(
+        'Critical Reorg Detected',
+        `A critical reorg of size ${reorg_size} has occurred.`,
+        Severity.CRITICAL,
+        metadata,
+        logger,
+      );
+    } else if (reorg_size >= REORG_SIZE_MAJOR) {
+      // ... remaining alert logic
+    }
+  } catch (e) {
+    span.setStatus({ code: SpanStatusCode.ERROR, message: String(e) });
+    span.recordException(e as Error);
+    throw e;
+  } finally {
+    span.end();
+  }
   });
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/services/index.ts` around lines 1009 - 1070,
handleReorgStarted can leak the OpenTelemetry span when addAlert or other code
throws because span.end() is only called on the happy path or early throws; wrap
the span'd work in a try/finally so span.end() always runs. Specifically, inside
the tracer.startActiveSpan('handleReorgStarted', async (span) => { ... })
callback put the existing logic into a try block and call span.end() in a
finally block (optionally catch errors to log via logger.error) so any exception
from addAlert, getConfig, or other code won't prevent span.end() from being
called; keep existing early throws (No event / Invalid event) but ensure they
are inside the try so finally executes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/daemon/src/services/index.ts`:
- Around line 1009-1070: handleReorgStarted can leak the OpenTelemetry span when
addAlert or other code throws because span.end() is only called on the happy
path or early throws; wrap the span'd work in a try/finally so span.end() always
runs. Specifically, inside the tracer.startActiveSpan('handleReorgStarted',
async (span) => { ... }) callback put the existing logic into a try block and
call span.end() in a finally block (optionally catch errors to log via
logger.error) so any exception from addAlert, getConfig, or other code won't
prevent span.end() from being called; keep existing early throws (No event /
Invalid event) but ensure they are inside the try so finally executes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c3aa172b-1154-49e7-bc7a-dbb6d2bff39e

📥 Commits

Reviewing files that changed from the base of the PR and between 613413c and 2a51d0f.

📒 Files selected for processing (5)
  • packages/daemon/docker-compose.dev.yml
  • packages/daemon/env.development
  • packages/daemon/package.json
  • packages/daemon/src/services/index.ts
  • packages/daemon/src/tracing.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/daemon/env.development

@andreabadesso andreabadesso force-pushed the feat/daemon-otel-tracing branch from 81e8b39 to 24c1dd0 Compare March 14, 2026 12:18
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/daemon/src/services/index.ts (1)

1026-1085: ⚠️ Potential issue | 🟠 Major

handleReorgStarted can leak spans if await addAlert() throws.

The function lacks try/catch/finally; if any awaited addAlert() call throws, span.end() won't execute and error metadata won't be recorded. The manual span.end() calls only cover early validation paths.

Suggested fix
export const handleReorgStarted = async (context: Context): Promise<void> => {
  return tracer.startActiveSpan('handleReorgStarted', async (span) => {
+  try {
    if (!context.event) {
-    span.end();
      throw new Error('No event in context');
    }

    const fullNodeEvent = context.event;
    if (fullNodeEvent.event.type !== FullNodeEventTypes.REORG_STARTED) {
-    span.end();
      throw new Error('Invalid event type for REORG_STARTED');
    }

    const { reorg_size, previous_best_block, new_best_block, common_block } = fullNodeEvent.event.data;

    span.setAttribute('reorg.size', reorg_size);
    const { REORG_SIZE_INFO, REORG_SIZE_MINOR, REORG_SIZE_MAJOR, REORG_SIZE_CRITICAL } = getConfig();
    const metadata = { reorg_size, previous_best_block, new_best_block, common_block };

    if (reorg_size >= REORG_SIZE_CRITICAL) {
      await addAlert('Critical Reorg Detected', `A critical reorg of size ${reorg_size} has occurred.`, Severity.CRITICAL, metadata, logger);
    } else if (reorg_size >= REORG_SIZE_MAJOR) {
      await addAlert('Major Reorg Detected', `A major reorg of size ${reorg_size} has occurred.`, Severity.MAJOR, metadata, logger);
    } else if (reorg_size >= REORG_SIZE_MINOR) {
      await addAlert('Minor Reorg Detected', `A minor reorg of size ${reorg_size} has occurred.`, Severity.MINOR, metadata, logger);
    } else if (reorg_size >= REORG_SIZE_INFO) {
      await addAlert('Reorg Detected', `A reorg of size ${reorg_size} has occurred.`, Severity.INFO, metadata, logger);
    }
+  } catch (e) {
+    span.setStatus({ code: SpanStatusCode.ERROR, message: String(e) });
+    span.recordException(e as Error);
+    throw e;
+  } finally {
-  span.end();
+    span.end();
+  }
  });
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/services/index.ts` around lines 1026 - 1085, The span in
handleReorgStarted (the tracer.startActiveSpan callback using span and calling
addAlert) can leak if an awaited addAlert throws; wrap the main logic that uses
span and calls addAlert in try/catch/finally so you record exceptions to the
span (e.g., span.recordException(err) and span.setStatus({ code:
SpanStatusCode.ERROR, message: err.message })) in the catch, ensure span.end()
runs in finally, and rethrow the error after recording so existing error
handling upstream still applies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daemon/src/services/index.ts`:
- Around line 348-357: In handleVertexAccepted remove the early span.end() calls
that occur before returning when duplicate transactions and grace-period checks
trigger (the span.end() at the duplicate-tx branch and the one before the early
return when daemonUptime < DUPLICATE_TX_ALERT_GRACE_PERIOD) so that the span is
only closed in the existing finally block; locate the duplicate span.end()
invocations near the checks using DUPLICATE_TX_ALERT_GRACE_PERIOD and the
logging of Transaction ${hash} and delete those early endings, leaving final
span cleanup to the finally block.

---

Outside diff comments:
In `@packages/daemon/src/services/index.ts`:
- Around line 1026-1085: The span in handleReorgStarted (the
tracer.startActiveSpan callback using span and calling addAlert) can leak if an
awaited addAlert throws; wrap the main logic that uses span and calls addAlert
in try/catch/finally so you record exceptions to the span (e.g.,
span.recordException(err) and span.setStatus({ code: SpanStatusCode.ERROR,
message: err.message })) in the catch, ensure span.end() runs in finally, and
rethrow the error after recording so existing error handling upstream still
applies.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2419a4a3-9843-4f32-aa37-8133b58f3ef4

📥 Commits

Reviewing files that changed from the base of the PR and between 2a51d0f and 81e8b39.

📒 Files selected for processing (3)
  • packages/daemon/docker-compose.dev.yml
  • packages/daemon/src/index.ts
  • packages/daemon/src/services/index.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/daemon/docker-compose.dev.yml

Comment thread packages/daemon/src/services/index.ts Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/daemon/src/index.ts (1)

18-19: Unused tracer variable.

The tracer variable is created but never used in this file. Only trace.getActiveSpan() is called directly. Consider removing the unused variable.

🔧 Proposed fix
-const tracer = trace.getTracer('wallet-service-daemon');
-

If you intend to use it later for manual spans, keep it but add a comment explaining the intended use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/index.ts` around lines 18 - 19, The variable tracer
created by calling trace.getTracer('wallet-service-daemon') is unused; either
remove the declaration of tracer to eliminate dead code, or if you plan to use
it for manual spans later, keep it and add a short inline comment next to the
trace.getTracer('wallet-service-daemon') call explaining its intended future use
(e.g., for creating manual spans), ensuring all current references rely on
trace.getActiveSpan() only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/daemon/src/index.ts`:
- Around line 18-19: The variable tracer created by calling
trace.getTracer('wallet-service-daemon') is unused; either remove the
declaration of tracer to eliminate dead code, or if you plan to use it for
manual spans later, keep it and add a short inline comment next to the
trace.getTracer('wallet-service-daemon') call explaining its intended future use
(e.g., for creating manual spans), ensuring all current references rely on
trace.getActiveSpan() only.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88b60704-7f0b-4c5a-9104-682eb4a5d780

📥 Commits

Reviewing files that changed from the base of the PR and between 81e8b39 and a90389a.

⛔ Files ignored due to path filters (1)
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (6)
  • packages/daemon/docker-compose.dev.yml
  • packages/daemon/env.development
  • packages/daemon/package.json
  • packages/daemon/src/index.ts
  • packages/daemon/src/services/index.ts
  • packages/daemon/src/tracing.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/daemon/env.development
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/daemon/docker-compose.dev.yml

@andreabadesso andreabadesso moved this from In Progress (Done) to In Progress (WIP) in Hathor Network Mar 19, 2026
@andreabadesso andreabadesso changed the title feat(daemon): add OpenTelemetry distributed tracing feat: add OpenTelemetry distributed tracing to daemon and Lambdas Mar 30, 2026
@andreabadesso andreabadesso force-pushed the feat/daemon-otel-tracing branch 2 times, most recently from c92fd9c to 9177a07 Compare March 30, 2026 12:19
Comment thread packages/daemon/package.json Outdated
Comment thread packages/wallet-service/package.json Outdated
Comment thread packages/wallet-service/src/tracing.ts Outdated
@andreabadesso andreabadesso force-pushed the feat/daemon-otel-tracing branch from 9177a07 to b698bf6 Compare March 30, 2026 13:01
andreabadesso and others added 2 commits March 30, 2026 10:23
Implements RFC 0001-opentelemetry-observability for the daemon package.

- SDK init in tracing.ts with BatchSpanProcessor and OTLP HTTP exporter
- Auto-instrumentation for mysql2, http, and aws-sdk (fs/dns disabled)
- Manual spans on all critical event handlers via withSpan() helper
- Root event.processing.* spans via tracedService() wrapper in SyncMachine
- Winston log correlation: trace_id and span_id injected into log output
- Tracing is a no-op when OTEL_EXPORTER_OTLP_ENDPOINT is not set
- Graceful shutdown on SIGTERM with span flush
- Local dev setup: docker-compose.dev.yml with MySQL + Jaeger

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Manual SDK init (RFC Option B) with cherry-picked instrumentations to
stay under Lambda's 250MB unzipped size limit (+39MB vs +120MB for the
full auto-instrumentations-node meta-package).

Instrumentations: aws-lambda, aws-sdk, http, mysql, redis, winston.
Webpack entries prepended with tracing.ts for auto-instrumentation.

- OTEL_SDK_DISABLED=true to skip all OTel module loading (kill switch)
- OTEL_CONSOLE_DEBUG=true to print spans to CloudWatch logs
- OTEL_EXPORTER_OTLP_ENDPOINT to enable OTLP export

Cold start benchmark (A/B, n=7 each on getLatestBlock):
  - Init overhead: +706ms (1164ms -> 1870ms)
  - Warm handler median overhead: +6.3ms
  - Warm handler p95: periodic 60-200ms spikes from BatchSpanProcessor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andreabadesso andreabadesso force-pushed the feat/daemon-otel-tracing branch from b698bf6 to 79a6ee8 Compare March 30, 2026 13:23
@andreabadesso andreabadesso moved this from In Progress (WIP) to In Progress (Done) in Hathor Network Mar 30, 2026
pedroferreira1
pedroferreira1 previously approved these changes Apr 6, 2026
@andreabadesso andreabadesso moved this from In Progress (Done) to In Review (WIP) in Hathor Network Apr 6, 2026
FULLNODE_HOST should be explicitly set, not default to hathor-dynasty.network.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pedroferreira1
pedroferreira1 previously approved these changes Apr 6, 2026
Comment thread packages/wallet-service/src/tracing.ts Outdated
Comment thread packages/daemon/src/machines/SyncMachine.ts Outdated
Comment thread packages/daemon/src/services/index.ts
Comment thread packages/daemon/src/tracing.ts Outdated
Comment thread packages/daemon/src/services/index.ts
Comment thread packages/daemon/src/tracing.ts Outdated
Comment thread packages/daemon/src/tracing.ts Outdated
Comment thread packages/wallet-service/src/tracing.ts
andreabadesso and others added 7 commits April 8, 2026 15:14
Allow Lambdas to pick up parent trace context from callers (e.g. daemon
invoking Lambdas via AWS SDK), enabling end-to-end distributed traces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each service function already creates its own span via startActiveSpan
internally, so the tracedService wrapper in SyncMachine produced
duplicate spans for every event. Register services directly instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use only HTTP, MySQL, and Winston instrumentations instead of loading
all available auto-instrumentations. Reduces memory footprint and avoids
instrumenting unused libraries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Read the version from package.json at startup instead of defaulting to
'unknown', so traces always have a meaningful version without requiring
Kubernetes manifest changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SIGINT handler alongside SIGTERM so that spans in the export queue
are flushed when the process is interrupted (e.g. Ctrl+C in dev).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the entire OTel initialization in an OTEL_SDK_DISABLED check,
matching the Lambda tracing pattern. Uses require() so all OTel modules
are fully skipped when tracing is disabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andreabadesso andreabadesso merged commit 4c9d606 into master Apr 9, 2026
3 checks passed
@github-project-automation github-project-automation Bot moved this from In Review (WIP) to Waiting to be deployed in Hathor Network Apr 9, 2026
@andreabadesso andreabadesso deleted the feat/daemon-otel-tracing branch April 9, 2026 12:05
@andreabadesso andreabadesso moved this from Waiting to be deployed to Done in Hathor Network Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants