feat(diagnostics-otel): OpenTelemetry diagnostics with GenAI semantic conventions#7
Closed
Baukebrenninkmeijer wants to merge 3711 commits into
Closed
feat(diagnostics-otel): OpenTelemetry diagnostics with GenAI semantic conventions#7Baukebrenninkmeijer wants to merge 3711 commits into
Baukebrenninkmeijer wants to merge 3711 commits into
Conversation
* changelog: add security deepMerge prototype-pollution fix entry * update: refresh gateway service env during update restart * test(cli): fix daemon install mock assertion * test(cli): guard update restart false path
…s to user context (openclaw#20597) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 175919a Co-authored-by: anisoptera <768771+anisoptera@users.noreply.github.com> Co-authored-by: mbelinky <132747814+mbelinky@users.noreply.github.com> Reviewed-by: @mbelinky
* fix(docker): pin base images to SHA256 digests for supply chain security Pin all 9 Dockerfiles to immutable SHA256 digests to prevent supply chain attacks where a compromised upstream image could be silently pulled into production builds. Also add Docker ecosystem to Dependabot configuration for automated digest updates. Images pinned: - node:22-bookworm@sha256:cd7bcd2e7a1e6f72052feb023c7f6b722205d3fcab7bbcbd2d1bfdab10b1e935 - node:22-bookworm-slim@sha256:3cfe526ec8dd62013b8843e8e5d4877e297b886e5aace4a59fec25dc20736e45 - debian:bookworm-slim@sha256:98f4b71de414932439ac6ac690d7060df1f27161073c5036a7553723881bffbe - ubuntu:24.04@sha256:cd1dba651b3080c3686ecf4e3c4220f026b521fb76978881737d24f200828b2b Fixes openclaw#7731 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * test(docker): add digest pinning regression coverage --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…nclaw#21086) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
… conventions - Align OTel spans with GenAI semantic conventions (gen_ai.* attributes, metrics, tool spans) - Add inference spans, content capture gating, and tool execution diagnostic events - Add OTel trace lifecycle to followup runner and subagent runs - Split oversized service into otel-event-handlers, otel-metrics, and otel-utils modules - Fix trace header leak, conversation ID consistency, and span context validation - Include cached tokens in gen_ai.usage.input_tokens calculation - Guard inner exporter to ensure resultCallback is always invoked - Add comprehensive test coverage for spans, metrics, and diagnostic events Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add missing trace-context-propagator.ts source module (PR review comment) - Add DiagnosticModelUsageEvent type to diagnostic-events.ts union - Add stateDir to OTel test createTestCtx() helpers (new required field) - Fix stop() signature to accept ctx parameter in OTel service - Add sessionKey/sessionId/channel to ToolHandlerParams Pick type - Remove duplicate sessionKey from SubscribeEmbeddedPiSessionParams - Fix type casts for mock .attributes/.kind access in test files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Upgrades the
@openclaw/diagnostics-otelexporter to produce structured, per-call telemetry aligned with OpenTelemetry GenAI semantic conventions.openclaw.agent.turn) per agent turngen_ai.tool.*attributesdiagnostics.otel.captureContent) for messages and tool I/Ogen_ai.client.operation.duration,gen_ai.client.time_to_first_token,gen_ai.client.token.usageEvent model
Replaces the monolithic
model.usageevent with a structured lifecycle:run.started— agent turn beginsmodel.inference.started— LLM call begins (captures input messages, system instructions, tool definitions)model.inference— LLM call ends (duration, TTFT, usage, output messages)tool.execution— tool call (duration, errors, optional I/O)run.completed— agent turn ends (aggregate usage, cost, duration)Key design decisions
diagnostics.otel.captureContent— when disabled, spans still include timings/usage/errorsopenai,anthropic,gcp.gemini, etc.)Symbol.for()used for global diagnostic state key (better cross-module isolation)dispatchDepth) retained for diagnostic event dispatch safetyTest plan
pnpm vitest run extensions/diagnostics-otel/src/service.test.tspnpm vitest run extensions/diagnostics-otel/src/service.metrics.test.tspnpm vitest run extensions/diagnostics-otel/src/service.spans.test.tspnpm vitest run src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.emits-diagnostic-tool-execution-events.test.tspnpm vitest run src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.emits-diagnostic-sessionkey.test.tspnpm vitest run src/commands/agent.diagnostics.test.tsnpx tsc --noEmitpasses (only pre-existing e2e test errors)🤖 Generated with Claude Code