-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat: tidy otel and implement http and MCP propagation #5151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
On the first pass this seems as a nice touch up. Thank you! |
|
@Kvadratni thanks and feedback welcome. I'm quite new at rust so any suggestions are welcome! |
|
putting this back in draft because the integration test isn't working. I will be out next week so bear this in mind. happy to get feedback meanwhile if you like, or you can wait until I pull it out of draft again |
Signed-off-by: Adrian Cole <[email protected]>
9f70920 to
4a5c925
Compare
|
Thank you so much for working on this. Just wanted to make sure - have you linked to the proper related issue? Just so it closes or updates what is related <3 |
|
@taniandjerry thanks for the feedback, so this topic is fairly ancient back to the python exchange days and most recently #203 and the polishing is a follow up to my old PR in python #75. Would it help for me to also make an issue which captures the motivation parts of the PR description then link this back to it? |
|
This PR is unrelated to hacktoberfest. @taniandjerry No worries on issue number @codefromthecrypt !! |
|
cool will retouch this up once #5165 merges so I make sure I get my code fashion together ;) |
|
@codefromthecrypt I'd love to have a completed version of this one. Do you have bandwidth to update it? |
|
sorry yeah.. I got distracted on a side-quest about the claude provider and MCP based approval integration with goose. lemme refocus on this now |
|
yesterday escaped me.. doing it now |
|
update: I'm working hard on this, I need to stop for the day. I don't want to raise this again without 100pct verified integration test approach for validating propagtion. This is somewhat new ground as the only thing similar is the MCP replay tests. I prefer to delay another day to do this right than not, so hopefully understandable. |
|
@codefromthecrypt of course - let us know whenever ready for a review! It's a big feature and will be huge to get it right |
|
I'm sorry, but I'm unable to complete this PR and have to step back from it. The code affected by instrumentation has diverged from main since I started, and my initial understanding was already limited, making it hard to catch up. I've struggled to implement robust tests that would prevent future drift. My limited Rust experience is a big factor here—this section involves intricate instrumentation across multiple components, without sufficient end-to-end checks to ensure spans are properly initiated, terminated, and parented. While I've built similar features in Java, Go, and Python (where CLI testing feels more flexible), the setup in this project has proven challenging for me. I've also tried using agents to assist, but they haven't been effective for this level of complexity. Ultimately, this has consumed more time than anticipated, leading to frustration and delaying other priorities. With upcoming travel, I don't foresee gaining the necessary proficiency in the next few weeks. I believe this would be better handled by someone more familiar with the project's quirks, stronger in Rust, and able to introduce a new fixture for comprehensive end-to-end side-effect testing—such as running the Goose CLI and/or Playwright with OTEL environment variables against mocked OpenAI and MCP services, then verifying both traces and propagated data. I'd be happy to sketch more what that would look like, but I have to call bankruptcy on salvaging this PR. |
|
PS I had an LLM reverse engineer a guide from my local changes. hope it helps the next person (and doens't have a hallucination I missed!) Async + OpenTelemetryLibraries in UseAsync Runtime
Tracing/OpenTelemetry
How OpenTelemetry Span Context WorksThread-Local Storage MechanismOpenTelemetry in Rust uses thread-local storage via tracing's subscriber system to track the "current span": // tracing internally uses thread_local!
thread_local! {
static CURRENT_SPAN: RefCell<Option<SpanContext>>;
}The Critical Problem with AsyncThread-local storage does NOT propagate across When an async task yields at
The Bridge: tracing-opentelemetryThe Key insight: There are two different span types:
They're connected via the Alternative: Task-Local Storagetokio::task_local! does propagate across task_local! {
pub static SESSION_ID: Option<String>;
}Used for session_id propagation (see commit 0515c76). Correct Instrumentation PatternsPattern 1: Regular Async Functions with #[instrument]Use #[instrument(name = "operation", skip(param1, param2), fields(session.id = %session_id))]
async fn operation(param1: String, param2: &Data, session_id: &str) -> Result<Output> {
// Span is active for entire function execution
// Propagates correctly across all .await points
let result = child_operation().await?;
Ok(result)
}How it works:
When to use:
Pattern 2: Manual Span CreationCreate spans manually when you need fine-grained control: fn create_chat_span(model_name: &str) -> tracing::Span {
tracing::info_span!(
"chat", // Span name for tracing
otel.name = format!("chat {}", model_name), // OTel display name
gen_ai.request.model = %model_name,
gen_ai.system = "openai",
gen_ai.operation.name = "chat"
)
}
async fn complete(...) -> Result<Message> {
let span = create_chat_span(&model_name);
let _guard = span.enter();
// All work happens under this span
let response = self.client.post(url).send().await?;
Ok(response)
}When to use instead of #[instrument]:
Why manual creation:
Pattern 3: Spawned TasksUse // Variant 1: Capture current span
let handle = tokio::spawn(
async move {
// work happens here with current span as parent
}.in_current_span()
);
// Variant 2: Propagate specific span
let span = tracing::Span::current();
let handle = tokio::spawn(
async move {
// work happens here with captured span as parent
}.instrument(span)
);How it works:
When to use each:
Pattern 4: Lazy Streams (async_stream::try_stream!)Capture span before stream creation, enter when stream executes: async fn create_stream(...) -> Result<BoxStream<'_, Result<Item>>> {
// 1. Capture current span before creating lazy stream
let span = tracing::Span::current();
Ok(Box::pin(async_stream::try_stream! {
// 2. Enter the span when stream executes
// IMPORTANT: Use underscore-prefixed name to keep guard alive
let _guard = span.enter();
// 3. All operations now have correct parent
loop {
let item = do_work().await?;
yield item;
}
// Guard drops here, exiting span
}))
}CRITICAL: Span Guard Lifetime // WRONG - guard drops immediately, span not active
let _ = span.enter();
// CORRECT - guard lives until end of scope
let _guard = span.enter();
let _span_guard = span.enter(); // Also correctThe underscore prefix ( Why this works:
Why NOT use .in_current_span():
OpenTelemetry Span HierarchySpan Context ComponentsEach span has:
Parent-Child RelationshipsWhen creating a span, OpenTelemetry:
Validation RulesA span context is valid only if:
Invalid span contexts:
Propagation Methods
Cross-Process Trace Context PropagationArchitecture OverviewW3C Trace Context FormatComponent Meanings:
State Transformation PipelineThe span context goes through several transformations when crossing process boundaries: HTTP PropagationInjection Point: The trace context is injected during HTTP request building, after the request is constructed but before it's sent. Example HTTP Request: POST /v1/chat/completions HTTP/1.1
Host: api.openai.com
Authorization: Bearer sk-...
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
Content-Type: application/json
{ "model": "gpt-4o", "messages": [...] }When propagation is skipped:
MCP PropagationThe Extensions Mechanism: MCP uses a type-erased Example MCP Request: {
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "read_file",
"arguments": { "path": "/tmp/file.txt" }
},
"_meta": {
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
"session_id": "20251106_42"
}
}Key insights:
Remote Parent RelationshipsWhen trace context crosses a process boundary, the span hierarchy is maintained through ID linkage: Critical concept:
Receiver Implementation Pattern (Abstract)When a service receives a request with trace context: Propagation TimingThe trace context is extracted and injected at a specific point in the request lifecycle: Critical point: Propagation (t3) happens while the span is still active. The span context is captured at request-build time, not at span-creation time. Error ScenariosOpenTelemetry Semantic ConventionsGenAI Semantic ConventionsFor LLM operations, we follow the OpenTelemetry GenAI semantic conventions: Required attributes:
Span naming:
Why manual span creation:
Example: fn create_chat_span(model_name: &str) -> tracing::Span {
tracing::info_span!(
"chat", // Internal tracing name
otel.name = format!("chat {}", model_name), // OTel display name
gen_ai.request.model = %model_name,
gen_ai.system = "openai",
gen_ai.operation.name = "chat"
)
}Tool Execution SpansTool calls create spans with:
Pattern: fn tool_stream<S, F>(
rx: S,
done: F,
tool_name: Option<String>
) -> ToolStream {
Box::pin(async_stream::stream! {
// Create and enter span if tool name provided
let tool_span = tool_name.as_ref().map(|name| {
tracing::info_span!("gen_ai.tool.call", tool.name = %name)
});
let _span_guard = tool_span.as_ref().map(|span| span.enter());
// Tool execution happens here under span
// ...
// Span ends when guard drops
})
}CLI Execution SpansThe CLI creates root spans for recipe and run execution: Recipe execution: let span = tracing::info_span!(
"recipe.execute",
session.id = %session_id,
recipe.name = %recipe_name
);
let _enter = span.enter();
// Recipe execution happens hereRun execution: let span = tracing::info_span!(
"run.execute",
session.id = %session_id
);
let _enter = span.enter();
// Run execution happens hereThese root spans become the parent for all agent operations, creating a complete trace hierarchy. Example PatternsExample 1: LLM Provider Span Patternfn create_chat_span(model_name: &str) -> tracing::Span {
tracing::info_span!(
"chat",
otel.name = format!("chat {}", model_name),
gen_ai.request.model = %model_name,
gen_ai.system = "openai",
gen_ai.operation.name = "chat"
)
}
async fn stream(...) -> Result<MessageStream, ProviderError> {
// Create span before lazy stream
let span = create_chat_span(&self.model.model_name);
Ok(Box::pin(try_stream! {
// Enter span when stream executes
let _span_guard = span.enter();
// HTTP request happens here with correct parent span
// The API client will inject traceparent header automatically
let response = self.client.post(url).send().await?;
// Stream response processing
// ...
// Span ends when guard drops (stream completes)
}))
}Uses the lazy stream pattern with manual span creation for GenAI conventions. Example 2: Agent Reply Pattern#[instrument(name = "agent.reply", skip(self, user_message, session_config),
fields(user_message, session.id = %session_config.id))]
pub async fn reply(
&self,
user_message: Message,
session_config: &SessionConfig,
) -> Result<AgentStream> {
// #[instrument] creates span for this function
self.reply_internal(user_message, session_config).await
}
async fn reply_internal(
&self,
user_message: Message,
session_config: &SessionConfig,
) -> Result<BoxStream<'_, Result<AgentEvent>>> {
// Capture span from reply() before creating lazy stream
let reply_span = tracing::Span::current();
Ok(Box::pin(async_stream::try_stream! {
// Enter span when stream executes
let _reply_span_guard = reply_span.enter();
loop {
// LLM calls here have correct parent span (agent.reply)
let mut stream = Self::stream_response_from_provider(...).await?;
// Tool calls here also have correct parent span
let tool_result = self.dispatch_tool_call(...).await;
// All nested operations inherit the agent.reply span as parent
}
}))
}Two-level pattern:
Testing Span HierarchiesExpected Hierarchy ExampleConceptual Validation PatternWhen testing distributed traces, verify: 1. Parent-Child Relationships: // Child's parent_span_id should match parent's span_id
assert_eq!(
child_span.parent_span_id,
parent_span.span_context.span_id(),
"Child should link to parent"
);2. Shared Trace ID: // All spans in same trace share trace_id
assert_eq!(
child_span.span_context.trace_id(),
parent_span.span_context.trace_id(),
"All spans must share trace_id"
);3. Temporal Enclosure: // Parent must strictly enclose child in time
assert!(parent.start_time <= child.start_time);
assert!(parent.end_time >= child.end_time);4. Cross-Process Propagation: // Extract traceparent from request
let traceparent = /* extract from _meta or header */;
// Parse W3C format
let parts: Vec<&str> = traceparent.split('-').collect();
assert_eq!(parts.len(), 4, "Invalid W3C format");
assert_eq!(parts[0], "00", "Invalid version");
assert_eq!(parts[1].len(), 32, "Invalid trace_id length");
assert_eq!(parts[2].len(), 16, "Invalid span_id length");
// Verify span context is valid (not all zeros)
assert_ne!(parts[2], "0000000000000000", "Span ID must be valid");Testing Infrastructure Patterns (Abstract)For testing distributed tracing, you need: 1. Trace Collector:
2. Fake Servers:
3. Validation:
4. Test Flow: Debugging and TroubleshootingCommon Issues1. Spans not appearing in traces:
2. Broken span hierarchy (orphaned spans):
3. Missing trace context in remote calls:
4. Span guard lifetime issues:
Verification Techniques1. Check active span: let span = tracing::Span::current();
eprintln!("Current span: {:?}", span.metadata().map(|m| m.name()));2. Verify span context validity: use opentelemetry::trace::TraceContextExt;
use tracing_opentelemetry::OpenTelemetrySpanExt;
let context = tracing::Span::current().context();
let span = context.span();
let span_context = span.span_context();
eprintln!("Span context valid: {}", span_context.is_valid());
eprintln!("Trace ID: {:032x}", span_context.trace_id());
eprintln!("Span ID: {:016x}", span_context.span_id());3. Inspect propagated traceparent: // For HTTP (in request builder)
eprintln!("Injecting traceparent: {}", traceparent_header);
// For MCP (in extension injection)
eprintln!("Meta contains traceparent: {}", meta.get("traceparent").is_some());SummaryPattern Selection Guide
When to Use Manual Span CreationUse manual span creation instead of
Key Principles
|
Summary
This improves the OpenTelemetry implementation such that it correctly creates distributed traces.
This focuses on trace propagation, both to LLM providers (HTTP headers) and MCP servers (request metadata).
This also prefers standard otel environment configuration and defaults instead of hard-coding, allowing drop-in usage with the same ENV as python or other language agents.
Type of Change
Testing
$ docker run --rm --name aigw -p 1975:1975 \ -e OPENAI_BASE_URL=http://host.docker.internal:11434/v1 -e OPENAI_API_KEY=unused \ -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318 -e OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf \ -e OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta \ -e OTEL_METRIC_EXPORT_INTERVAL=100 -e OTEL_BSP_SCHEDULE_DELAY=100 \ envoyproxy/ai-gateway-cli run --mcp-json '{"mcpServers":{"kiwi":{"type":"http","url":"https://mcp.kiwi.com"}}}'Details
version: 1.0.0 title: Flight Search Assistant description: | This recipe helps users find flights from New York to Los Angeles using the MCP gateway. It searches for flights on a specified date and returns the top 3 cheapest options in a structured JSON format. instructions: | You are a helpful assistant that helps users find flights from New York to Los Angeles.prompt: |
Use the search-flight tool to search for flights from New York to Los Angeles on {{flight_date}}.
When specifying the departureDate parameter, use the format dd/mm/yyyy.
Display the first 3 results with the
final_outputtool in a specific JSON format.The output is in the following format where price includes the currency symbol:
ANY FLIGHT IS FINE. DO NOT CONSIDER ANY USER PREFERENCES.*
extensions:
type: streamable_http
uri: http://127.0.0.1:1975/mcp
parameters:
input_type: string
requirement: required
description: flight date in DD/MM/YYYY format
default: "31/12/2025"
settings:
Remove this if using GPT-5 series (e.g., gpt-5, gpt-5-mini, gpt-5-nano)
temperature: 0.0
response:
json_schema:
type: object
additionalProperties: false
required:
- contents
properties:
contents:
type: array
description: "First 3 flight results in a consistent, parseable structure."
minItems: 1
maxItems: 3
items:
type: object
additionalProperties: false
required:
- airline
- flight_number
- price
properties:
airline:
type: string
description: "Airline name"
flight_number:
type: string
description: "Flight number"
price:
type: string
description: "Price with currency symbol"
Related Issues
#75
Screenshots/Demos (for UX changes)