Skip to content

feat(diagnostics-otel): OpenTelemetry diagnostics with GenAI semantic conventions#7

Closed
Baukebrenninkmeijer wants to merge 3711 commits into
mainfrom
otel-diagnostics-fixes
Closed

feat(diagnostics-otel): OpenTelemetry diagnostics with GenAI semantic conventions#7
Baukebrenninkmeijer wants to merge 3711 commits into
mainfrom
otel-diagnostics-fixes

Conversation

@Baukebrenninkmeijer
Copy link
Copy Markdown
Collaborator

Summary

Upgrades the @openclaw/diagnostics-otel exporter to produce structured, per-call telemetry aligned with OpenTelemetry GenAI semantic conventions.

  • Run-level parent span (openclaw.agent.turn) per agent turn
  • Per-inference spans for each LLM call (initial, post-tool followups, loops)
  • Tool execution spans with gen_ai.tool.* attributes
  • Opt-in content capture (diagnostics.otel.captureContent) for messages and tool I/O
  • GenAI metrics: gen_ai.client.operation.duration, gen_ai.client.time_to_first_token, gen_ai.client.token.usage

Event model

Replaces the monolithic model.usage event with a structured lifecycle:

  1. run.started — agent turn begins
  2. model.inference.started — LLM call begins (captures input messages, system instructions, tool definitions)
  3. model.inference — LLM call ends (duration, TTFT, usage, output messages)
  4. tool.execution — tool call (duration, errors, optional I/O)
  5. run.completed — agent turn ends (aggregate usage, cost, duration)

Key design decisions

  • Input messages captured at the actual model-call boundary (not from streaming state)
  • Content capture gated behind diagnostics.otel.captureContent — when disabled, spans still include timings/usage/errors
  • Provider names normalized to GenAI enum (openai, anthropic, gcp.gemini, etc.)
  • Symbol.for() used for global diagnostic state key (better cross-module isolation)
  • Recursion guard (dispatchDepth) retained for diagnostic event dispatch safety

Test plan

  • pnpm vitest run extensions/diagnostics-otel/src/service.test.ts
  • pnpm vitest run extensions/diagnostics-otel/src/service.metrics.test.ts
  • pnpm vitest run extensions/diagnostics-otel/src/service.spans.test.ts
  • pnpm vitest run src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.emits-diagnostic-tool-execution-events.test.ts
  • pnpm vitest run src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.emits-diagnostic-sessionkey.test.ts
  • pnpm vitest run src/commands/agent.diagnostics.test.ts
  • npx tsc --noEmit passes (only pre-existing e2e test errors)

🤖 Generated with Claude Code

steipete and others added 30 commits February 19, 2026 07:45
steipete and others added 29 commits February 19, 2026 15:29
* changelog: add security deepMerge prototype-pollution fix entry

* update: refresh gateway service env during update restart

* test(cli): fix daemon install mock assertion

* test(cli): guard update restart false path
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 1beca3a
Co-authored-by: mbelinky <132747814+mbelinky@users.noreply.github.com>
Co-authored-by: mbelinky <132747814+mbelinky@users.noreply.github.com>
Reviewed-by: @mbelinky
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 31a27b0
Co-authored-by: mbelinky <132747814+mbelinky@users.noreply.github.com>
Co-authored-by: mbelinky <132747814+mbelinky@users.noreply.github.com>
Reviewed-by: @mbelinky
…s to user context (openclaw#20597)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 175919a
Co-authored-by: anisoptera <768771+anisoptera@users.noreply.github.com>
Co-authored-by: mbelinky <132747814+mbelinky@users.noreply.github.com>
Reviewed-by: @mbelinky
…aw#21226)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 7705a77
Co-authored-by: mbelinky <132747814+mbelinky@users.noreply.github.com>
Co-authored-by: mbelinky <132747814+mbelinky@users.noreply.github.com>
Reviewed-by: @mbelinky
* fix(docker): pin base images to SHA256 digests for supply chain security

Pin all 9 Dockerfiles to immutable SHA256 digests to prevent supply chain
attacks where a compromised upstream image could be silently pulled into
production builds.

Also add Docker ecosystem to Dependabot configuration for automated
digest updates.

Images pinned:
- node:22-bookworm@sha256:cd7bcd2e7a1e6f72052feb023c7f6b722205d3fcab7bbcbd2d1bfdab10b1e935
- node:22-bookworm-slim@sha256:3cfe526ec8dd62013b8843e8e5d4877e297b886e5aace4a59fec25dc20736e45
- debian:bookworm-slim@sha256:98f4b71de414932439ac6ac690d7060df1f27161073c5036a7553723881bffbe
- ubuntu:24.04@sha256:cd1dba651b3080c3686ecf4e3c4220f026b521fb76978881737d24f200828b2b

Fixes openclaw#7731

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test(docker): add digest pinning regression coverage

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…nclaw#21086)

* fix: treat HTTP 503 as failover-eligible for LLM provider errors

When LLM SDKs wrap 503 responses, the leading "503" prefix is lost
(e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a
numeric prefix). The existing isTransientHttpError only matches
messages starting with "503 ...", so these wrapped errors silently
skip failover — no profile rotation, no model fallback.

This patch closes that gap:

- resolveFailoverReasonFromError: map HTTP status 503 → rate_limit
  (covers structured error objects with a status field)
- ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable",
  "high demand" (covers message-only classification when the leading
  status prefix is absent)

Existing isTransientHttpError behavior is unchanged; these additions
are complementary and only fire for errors that previously fell
through unclassified.

* fix: address review feedback — drop /\b503\b/ pattern, add test coverage

- Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the
  semantic inconsistency noted by reviewers: `isTransientHttpError`
  already handles messages prefixed with "503" (→ "timeout"), so a
  redundant overloaded pattern would classify the same class of errors
  differently depending on message formatting.

- Keep "service unavailable" and "high demand" patterns — these are the
  real gap-fillers for SDK-rewritten messages that lack a numeric prefix.

- Add test case for JSON-wrapped 503 error body containing "overloaded"
  to strengthen coverage.

* fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError)

resolveFailoverReasonFromError previously mapped status 503 → "rate_limit",
while the string-based isTransientHttpError mapped "503 ..." → "timeout".

Align both paths: structured {status: 503} now also returns "timeout",
matching the existing transient-error convention. Both reasons are
failover-eligible, so runtime behavior is unchanged.

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
… conventions

- Align OTel spans with GenAI semantic conventions (gen_ai.* attributes, metrics, tool spans)
- Add inference spans, content capture gating, and tool execution diagnostic events
- Add OTel trace lifecycle to followup runner and subagent runs
- Split oversized service into otel-event-handlers, otel-metrics, and otel-utils modules
- Fix trace header leak, conversation ID consistency, and span context validation
- Include cached tokens in gen_ai.usage.input_tokens calculation
- Guard inner exporter to ensure resultCallback is always invoked
- Add comprehensive test coverage for spans, metrics, and diagnostic events

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add missing trace-context-propagator.ts source module (PR review comment)
- Add DiagnosticModelUsageEvent type to diagnostic-events.ts union
- Add stateDir to OTel test createTestCtx() helpers (new required field)
- Fix stop() signature to accept ctx parameter in OTel service
- Add sessionKey/sessionId/channel to ToolHandlerParams Pick type
- Remove duplicate sessionKey from SubscribeEmbeddedPiSessionParams
- Fix type casts for mock .attributes/.kind access in test files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.