feat(diagnostics-otel): nested message spans + conversation.id fix#3
Conversation
…d for conversation.id - Create root span on message.queued, end it on message.processed instead of creating a separate orphan span. This ensures the full message lifecycle (queue → agent turn → LLM calls → tools → done) appears as a single nested trace rather than disconnected spans. - Parent agent.turn spans under the root message span via activeTraces map keyed by sessionKey. - Use sessionId (stable conversation identifier that resets on /reset) for gen_ai.conversation.id instead of sessionKey (composite key including channel and agent info that never changes). This aligns with the OTEL GenAI semantic convention for thread/session correlation. - Clean up activeTraces on service stop.
handleAgentStart now resolves sessionKey via getAgentRunContext(runId) and includes it in the run.started diagnostic event. This allows the OTEL plugin to parent agent.turn spans under the root openclaw.message span via direct sessionKey lookup, since message.queued events use sessionKey as the correlation key.
|
@rogeriochaves — Great PR! The architectural approach is solid. Here's a comprehensive review covering code quality, error handling, test coverage, and comment accuracy. Critical Issues (2 found)1.
|
| Priority | Gap |
|---|---|
| P0 | Root span → agent.turn parent-child nesting not verified |
| P0 | message.processed ending the root span (.end() called) not verified |
| P1 | Error outcome path (SpanStatusCode.ERROR) not verified |
| P1 | message.processed without matching message.queued |
| P2 | stop() cleanup of orphaned root traces |
| P2 | message.queued without sessionKey (no span created) |
The existing test infrastructure already supports all of these (the trace.setSpan mock, _parentCtx capture, etc.).
Strengths
- The architectural approach is sound — nesting the full message lifecycle under a root span eliminates the disconnected orphan traces problem
- The core fix in
pi-embedded-subscribe.handlers.lifecycle.tsis clean and minimal (3 lines) - Using
sessionIdforgen_ai.conversation.idcorrectly aligns with OTEL GenAI semantic conventions - Comments explaining the "why" (especially the
gen_ai.conversation.idrationale and the lifecycle sequence inrecordMessageQueued) are well-written - Metrics recording is preserved regardless of trace state
Recommended Action
- Fix the two critical issues (memory leak + incomplete
conversation.idmigration) before merge - Address the important issues (silent trace drop, missing attributes, test mock, non-null assertion)
- Add test coverage for at minimum the P0 gaps (nesting verification + span ending)
- Suggestions can be addressed as follow-ups
- Add TTL cleanup for activeTraces map (prevents unbounded memory leak) - Add outcome/duration/messageId attributes to root span before ending - Add debug log when message.processed has no matching root trace - Add debug log when getAgentRunContext returns no sessionKey - Fix SpanStatusCode.OK/UNSET missing from test mock - Remove non-null assertion on evt.sessionKey - Fix comment accuracy for message lifecycle hierarchy - Add P0 test coverage: nesting verification, span ending, error path, and graceful handling of unmatched message.processed
Review feedback addressed (fd83273)Pushed a commit addressing the review feedback: Fixed
New tests (4 P0 tests added, now 27 total)
Not addressed
|
e9ce4d2
into
orq-ai:otel-diagnostics-fixes
Summary
Two improvements to the OTEL diagnostics plugin's trace structure and semantic compliance, plus a core fix to ensure proper span nesting.
1. Nested message lifecycle spans (no more orphan
message.processedtraces)Problem: The
message.processeddiagnostic event was creating a standalone OTEL span with no parent context. This resulted in two separate traces appearing in observability platforms for every single message — one for the actual agent work (openclaw.agent.turnwith nested LLM/tool spans) and a disconnected orphan foropenclaw.message.processed.Root cause: In OpenClaw, the message lifecycle is a sequence of internal diagnostic events:
The
message.queued/message.processedpair represents the outer envelope — the full lifecycle from when a message enters the queue to when processing completes (including queue wait time). Therun.*andmodel.inference.*events represent the inner work — the actual agent execution.From an OTEL semantic perspective, these are not independent operations — they are nested levels of the same unit of work. A message being processed IS the agent turn; the queue wait + agent execution + completion are phases of one trace, not separate ones. Creating a standalone span for
message.processedis like having both a function call AND its return statement as separate traces — it double-counts and breaks the parent-child hierarchy.Fix:
message.queuednow creates a root span (openclaw.message) that becomes the parent for all subsequent spans in that sessionagent.turnspans (created byensureRunSpan) are automatically parented under this root span viaactiveTraceslookup bysessionKeymessage.processedends the root span instead of creating a new one, setting OK/ERROR status based on the outcomeThe resulting trace hierarchy:
2. Propagate
sessionKeytorun.starteddiagnostic event (core fix)Problem: The
run.startedevent only carriedsessionId(from the agent session) but notsessionKey(from the message queue layer). Since the OTEL plugin usessessionKeyto correlate the rootopenclaw.messagespan (created atmessage.queuedtime) with child spans, the parent-child nesting silently failed — producing two separate traces instead of one nested hierarchy.Root cause:
handleAgentStartinpi-embedded-subscribe.handlers.lifecycle.tsemittedrun.startedwithoutsessionKeybecause the subscribe params didn't include it. However,registerAgentRunContext(runId, { sessionKey })is already called before the run starts (inagent-runner-execution.ts), making sessionKey available viagetAgentRunContext(runId).Fix:
handleAgentStartnow resolvessessionKeyviagetAgentRunContext(runId)and includes it in therun.startedevent. This is a 3-line change to core that enables clean parent-child span nesting without any workarounds in the plugin.3. Use
sessionIdforgen_ai.conversation.id(OTEL GenAI semantic convention)Problem:
gen_ai.conversation.idwas set tosessionKey(e.g.,telegram:myagent:12345), which is a composite key that includes channel and agent info and never changes across/resetcommands. This meant all messages in a channel were permanently grouped as the same "conversation" in downstream platforms.Fix: Changed to use
sessionIdinstead, which is the stable conversation identifier that resets when the user starts a new session (e.g., via/reset). This aligns with the OTEL GenAI semantic convention wheregen_ai.conversation.idis defined as "the unique identifier for a conversation (session, thread), used to store and correlate messages within this conversation."Test plan
openclaw.messageSpanStatusCode.ERRORrecordMessageProcessedwith no matching root trace doesn't throwgen_ai.conversation.idusessessionId(notsessionKey)/reset→ send message → verify newthread_idin LangWatch