Skip to content

fix: distributed tracing propagation for TCP transport and SGLang integration#5122

Closed
ishandhanani wants to merge 6 commits intomainfrom
ishan/tracingupdate
Closed

fix: distributed tracing propagation for TCP transport and SGLang integration#5122
ishandhanani wants to merge 6 commits intomainfrom
ishan/tracingupdate

Conversation

@ishandhanani
Copy link
Contributor

@ishandhanani ishandhanani commented Dec 31, 2025

Summary

This PR fixes distributed tracing by addressing several issues:

  • SGLang Backend: Update to use new external_trace_header API from SGLang PR #15814
  • HTTP Frontend: Fix parent context linking so http-request spans are properly linked to incoming trace context
  • TCP Transport: Fix trace header propagation - the root cause of broken tracing

Root Cause

The TCP transport was silently dropping all trace headers. Only x-endpoint-path was being used for routing, but traceparent, tracestate, x-request-id, etc. were never included in the wire protocol. This caused frontend and worker spans to appear disconnected in Tempo/Grafana.

NATS transport worked correctly because it has native header support.

Changes

SGLang Backend (components/src/dynamo/sglang/)

  • Replace _propagate_trace_context_to_sglang() with _get_trace_header()
  • Pass trace headers directly to async_generate(external_trace_header=...) instead of using global state

HTTP Frontend (lib/runtime/src/logging.rs)

  • Add extract_otel_context_from_http_headers() to extract OTEL context from incoming requests
  • Update make_request_span() to call span.set_parent(context) for proper parent linking

TCP Transport Wire Protocol (lib/runtime/src/pipeline/network/)

  • Add headers: HashMap<String, String> field to TcpRequestMessage
  • Update wire format: endpoint_path_len | endpoint_path | headers_len | headers_json | payload_len | payload
  • Update tcp_client.rs to include headers when sending
  • Update shared_tcp_endpoint.rs to extract headers and create properly linked spans
  • Add make_handle_payload_span_from_tcp_headers() helper function

Test plan

  • Build succeeds
  • Aggregated serving with --enable-otel works
  • Verify traces appear correctly linked in Tempo/Grafana
  • Test disaggregated serving

Dependencies

Requires SGLang with PR #15814 merged or installed from that branch.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Enhanced distributed tracing with improved OpenTelemetry context propagation across HTTP, TCP, and NATS communication protocols.
    • Added trace header support in TCP request messages for better trace context flow.
  • Refactor

    • Optimized trace context handling and Prometheus client initialization timing.
    • Simplified internal trace propagation logic to use header-based approach.

✏️ Tip: You can customize this high-level summary in your review settings.

ishandhanani and others added 2 commits December 31, 2025 22:15
Move prometheus_client imports from module level to inside functions
to ensure they occur AFTER SGLang's set_prometheus_multiproc_dir() is called.

Problem:
- prometheus_client was imported at module level in publisher.py and prometheus.py
- This happened before sgl.Engine() called set_prometheus_multiproc_dir()
- prometheus_client initialized in single-process mode, ignoring PROMETHEUS_MULTIPROC_DIR
- TokenizerMetricsCollector metrics were stored in memory only, not mmap'd files
- MultiProcessCollector couldn't find them when scraping /metrics

Solution:
- Move imports inside functions that are called after engine initialization
- publisher.py: import in setup_prometheus_registry()
- prometheus.py: import in get_prometheus_expfmt()

Affected metrics now correctly exposed:
- sglang:prompt_tokens_total
- sglang:generation_tokens_total
- sglang:time_to_first_token_seconds
- sglang:e2e_request_latency_seconds
- sglang:inter_token_latency_seconds
- sglang:num_requests_total
- sglang:cached_tokens_total
- sglang:num_retractions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…egration

This PR fixes distributed tracing by:

1. SGLang Backend: Update to use new `external_trace_header` API from SGLang PR #15814
   - Replace `_propagate_trace_context_to_sglang()` with `_get_trace_header()`
   - Pass trace headers directly to `async_generate()` instead of using global state

2. HTTP Frontend: Fix parent context linking for incoming requests
   - Add `extract_otel_context_from_http_headers()` to extract OTEL context
   - Update `make_request_span()` to call `span.set_parent(context)`

3. TCP Transport: Fix trace header propagation (root cause of broken tracing)
   - Add `headers` field to `TcpRequestMessage` wire protocol
   - Include trace headers (traceparent, tracestate, etc.) in TCP messages
   - Add `make_handle_payload_span_from_tcp_headers()` for worker-side span creation

The TCP transport was silently dropping all trace headers - only using
`x-endpoint-path` for routing. This caused frontend and worker spans to be
disconnected. NATS transport worked correctly due to native header support.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ishandhanani ishandhanani requested review from a team as code owners December 31, 2025 23:16
@github-actions github-actions bot added the fix label Dec 31, 2025
@ishandhanani ishandhanani changed the base branch from main to fix/sglang-tokenizer-metrics-prometheus December 31, 2025 23:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 31, 2025

Walkthrough

This change defers prometheus_client imports after engine initialization and refactors trace context handling in SGLang request handlers from a propagation-based approach to header-based. Simultaneously, it adds OpenTelemetry context extraction from HTTP/TCP/NATS headers and extends the TCP wire protocol to carry trace headers in request messages.

Changes

Cohort / File(s) Summary
Prometheus initialization deferral
components/src/dynamo/common/utils/prometheus.py, components/src/dynamo/sglang/publisher.py
Moved top-level imports of prometheus_client classes to lazy, in-function imports within setup functions; added TYPE_CHECKING block in publisher.py; updated return type annotation to forward reference string.
Trace context refactoring (Python handlers)
components/src/dynamo/sglang/request_handlers/handler_base.py, components/src/dynamo/sglang/request_handlers/llm/decode_handler.py, components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py
Replaced _propagate_trace_context_to_sglang invocations with _get_trace_header method that returns a simple traceparent header dict or None; integrated header into external_trace_header parameter for engine calls.
OpenTelemetry context extraction (Rust logging)
lib/runtime/src/logging.rs
Added helpers to extract OpenTelemetry context from HTTP/TCP/NATS headers; introduced public make_handle_payload_span_from_tcp_headers function and private extraction helpers; integrated OTEL context as parent in span creation.
TCP message protocol & headers propagation
lib/runtime/src/pipeline/network/codec.rs, lib/runtime/src/pipeline/network/egress/tcp_client.rs, lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
Extended TcpRequestMessage to include headers field; added with_headers constructor and updated JSON serialization/deserialization in codec; TCP client now passes headers to message construction; ingress endpoint reads and decodes headers from wire protocol and uses them for span creation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Traces now hop through headers bright,
Prometheus defers with all its might,
OpenTelemetry spreads its wings,
TCP carries all these things!
Context flows where it should be,
Protocol changes, clean and free! 🌟

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The PR description is comprehensive and well-structured, covering root cause, changes, dependencies, and test plan.
Title check ✅ Passed The title accurately summarizes the main changes: implementing distributed tracing propagation for TCP transport and SGLang integration, which aligns with the comprehensive modifications across multiple files and components.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)

662-680: Tests don't account for the new headers section in the wire protocol.

The mock server reads path_len, path, then immediately payload_len and payload, but the new wire format includes headers_len and headers between path and payload. This will cause the test to misinterpret headers data as the payload length.

The same issue exists in test_concurrent_requests_single_connection (lines 720-741) and test_connection_pool_reuse (lines 819-834).

🔎 Proposed fix for test_connection_health_check
             let mut path_buf = vec![0u8; path_len];
             read_half.read_exact(&mut path_buf).await.unwrap();

+            // Read headers length (2 bytes)
+            let mut headers_len_buf = [0u8; 2];
+            read_half.read_exact(&mut headers_len_buf).await.unwrap();
+            let headers_len = u16::from_be_bytes(headers_len_buf) as usize;
+
+            // Read headers
+            let mut headers_buf = vec![0u8; headers_len];
+            read_half.read_exact(&mut headers_buf).await.unwrap();
+
             let mut len_buf = [0u8; 4];
             read_half.read_exact(&mut len_buf).await.unwrap();
             let payload_len = u32::from_be_bytes(len_buf) as usize;
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)

119-141: Fix Black formatting to pass CI.

The pipeline indicates Black formatting failed on these lines. Run black components/src/dynamo/sglang/request_handlers/llm/decode_handler.py to fix the formatting issues before merging.

🧹 Nitpick comments (3)
lib/runtime/src/pipeline/network/codec.rs (1)

70-107: JSON encoding for headers is acceptable but adds overhead.

Using JSON for headers serialization is straightforward and flexible. For high-throughput scenarios, consider whether a more compact binary format might be beneficial in the future.

lib/runtime/src/logging.rs (1)

301-334: Consider consolidating the three Extractor implementations.

HttpHeaderExtractor, TcpHeaderExtractor, and NatsHeaderExtractor follow nearly identical patterns. A generic extractor or shared trait implementation could reduce duplication.

🔎 Potential consolidation approach
// Generic extractor that works with any header source
struct GenericHeaderExtractor<F>(F)
where
    F: Fn(&str) -> Option<&str>;

impl<F> Extractor for GenericHeaderExtractor<F>
where
    F: Fn(&str) -> Option<&str>,
{
    fn get(&self, key: &str) -> Option<&str> {
        (self.0)(key)
    }

    fn keys(&self) -> Vec<&str> {
        vec!["traceparent", "tracestate"]
            .into_iter()
            .filter(|&key| self.get(key).is_some())
            .collect()
    }
}

Also applies to: 424-465, 467-508

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)

122-122: Trace header integration looks correct.

The trace header retrieval and passing to engine.async_generate is correctly implemented in both disaggregated and aggregated paths, consistent with the new header-based tracing approach.

💡 Optional: Consider extracting repeated trace header logic

The conditional trace_header = self._get_trace_header(context) if self.enable_trace else None appears in both paths. You could optionally extract this to reduce duplication:

def _get_trace_header_if_enabled(self, context: Context) -> Optional[Dict[str, str]]:
    """Get trace header if tracing is enabled."""
    return self._get_trace_header(context) if self.enable_trace else None

Then use: trace_header = self._get_trace_header_if_enabled(context)

However, given the simplicity and clarity of the current pattern, this may not be necessary.

Also applies to: 131-131, 142-142, 148-148

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b33c1d and d484c44.

📒 Files selected for processing (9)
  • components/src/dynamo/common/utils/prometheus.py
  • components/src/dynamo/sglang/publisher.py
  • components/src/dynamo/sglang/request_handlers/handler_base.py
  • components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
  • components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py
  • lib/runtime/src/logging.rs
  • lib/runtime/src/pipeline/network/codec.rs
  • lib/runtime/src/pipeline/network/egress/tcp_client.rs
  • lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-19T07:32:44.210Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 0
File: :0-0
Timestamp: 2025-09-19T07:32:44.210Z
Learning: The skip_tokenizer_init=True path in SGLang backend bypasses tokenization but has array slicing overhead in _process_token_stream that creates O(n) memory copying on every stream chunk, potentially causing quadratic behavior for long sequences.

Applied to files:

  • components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

  • lib/runtime/src/pipeline/network/codec.rs
  • lib/runtime/src/pipeline/network/egress/tcp_client.rs
  • lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
🧬 Code graph analysis (5)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)
components/src/dynamo/sglang/request_handlers/handler_base.py (1)
  • _get_trace_header (143-156)
lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)
lib/runtime/src/pipeline/network/codec.rs (1)
  • with_headers (46-56)
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1)
lib/runtime/src/logging.rs (1)
  • make_handle_payload_span_from_tcp_headers (380-422)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)
components/src/dynamo/sglang/request_handlers/handler_base.py (1)
  • _get_trace_header (143-156)
components/src/dynamo/sglang/request_handlers/handler_base.py (1)
lib/bindings/python/src/dynamo/_core.pyi (3)
  • Context (275-360)
  • trace_id (333-340)
  • span_id (343-350)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/5122/merge) by ishandhanani.
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

[error] 119-141: Black formatting failed: 1 file reformatted by this hook. The pre-commit run modified the file(s); please review and commit the changes. Re-run the pre-commit hook (e.g., 'pre-commit run --all-files').

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: vllm (amd64)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/bindings/python)
🔇 Additional comments (11)
components/src/dynamo/common/utils/prometheus.py (1)

120-145: LGTM! Lazy import pattern correctly addresses multi-process Prometheus initialization.

The deferred import of generate_latest ensures prometheus_client initializes after PROMETHEUS_MULTIPROC_DIR is set, preventing metrics collection failures in SGLang's multi-process architecture.

components/src/dynamo/sglang/publisher.py (1)

14-16: LGTM! Consistent lazy import pattern for prometheus_client.

The TYPE_CHECKING guard with forward reference annotation correctly preserves type hints while deferring the runtime import until after SGLang engine initialization.

Also applies to: 227-250

lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)

327-329: LGTM! Headers now propagated through TCP transport.

The change from TcpRequestMessage::new() to TcpRequestMessage::with_headers() correctly includes trace headers in the encoded message, enabling distributed tracing context propagation.

lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (2)

269-276: LGTM! Headers extraction from TCP wire protocol.

Correctly reads the headers section (length prefix + JSON data) from the updated wire format.


377-389: LGTM! Trace-aware span creation from TCP headers.

The span is now created using make_handle_payload_span_from_tcp_headers with headers extracted from the TCP message, enabling proper distributed trace context propagation and parent linking.

lib/runtime/src/pipeline/network/codec.rs (2)

30-56: LGTM! TcpRequestMessage extended with headers support.

The dual constructor pattern (new() for backward compatibility, with_headers() for trace propagation) is clean. Empty headers map for new() maintains compatibility with existing code.


241-264: LGTM! Decoder correctly handles the new wire format.

The peeking logic properly accounts for the headers section when determining if a complete message is available, calculating header_size as 2 + endpoint_len + 2 + headers_len + 4.

lib/runtime/src/logging.rs (2)

281-298: LGTM! HTTP request spans now properly linked to parent trace context.

Extracting OTEL context from incoming HTTP headers and calling span.set_parent(context) ensures frontend spans correctly connect to upstream traces in Tempo/Grafana.


379-422: LGTM! TCP header span creation mirrors NATS pattern.

The function correctly extracts trace context from TCP headers and creates properly-linked spans for distributed tracing through the TCP transport.

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)

116-116: LGTM! Clean integration with the new trace header approach.

The conditional trace header retrieval and passing to engine.async_generate correctly implements the new header-based distributed tracing mechanism, replacing the previous propagation approach.

Also applies to: 125-125

components/src/dynamo/sglang/request_handlers/handler_base.py (1)

143-156: Add validation or document that trace_id and span_id conform to W3C traceparent format.

SGLang 0.5.6.post2 does support the external_trace_header parameter. However, the _get_trace_header() method constructs a W3C traceparent header without validating that context.trace_id is 32 lowercase hex characters and context.span_id is 16 lowercase hex characters as required by the W3C Trace Context specification. Either add format validation before constructing the traceparent header, or document that the Context object guarantees these values conform to the specification.

@ishandhanani
Copy link
Contributor Author

Code Duplication Analysis

Reviewed the PR for potential duplication between NATS and TCP request plane implementations.

Identified Duplication

There are 3 sets of nearly identical functions in lib/runtime/src/logging.rs:

Function Header Type Lines
extract_otel_context_from_http_headers http::HeaderMap 302-333
extract_otel_context_from_tcp_headers HashMap<String, String> 425-465
extract_otel_context_from_nats_headers async_nats::HeaderMap 468-508
Function Header Type Lines
make_handle_payload_span_from_tcp_headers HashMap<String, String> 380-422
make_handle_payload_span async_nats::HeaderMap 337-377

Analysis

The duplication exists because each header type has a different API:

  • NATS: headers.get(key).map(|v| v.as_str())
  • HashMap: headers.get(key).map(|s| s.as_str())
  • HTTP: headers.get(key).and_then(|v| v.to_str().ok())

Recommendation

The duplication is acceptable for now because:

  1. Type safety: Each function is strongly typed to its header type, preventing misuse
  2. Minimal overhead: The Extractor structs are defined inline and zero-cost
  3. Different contexts: HTTP is for frontend, NATS for legacy transport, TCP for new transport
  4. Trait abstraction cost: Creating a HeaderMap trait would require:
    • A new trait definition
    • Implementing it for 3 external types (orphan rules may apply)
    • Generic functions with trait bounds
    • Potentially more complex error handling

If we wanted to refactor (future work)

We could create a simple trait:

trait HeaderAccessor {
    fn get_header(&self, key: &str) -> Option<&str>;
}

And implement it for each type, then have a single generic:

fn extract_otel_context<H: HeaderAccessor>(headers: &H) -> (Option<Context>, Option<String>, Option<String>)

But this adds complexity for ~40 lines of duplicated code that's unlikely to change frequently.

Verdict

The current code is fine. The duplication is localized, the functions are small, and the pattern is clear. Refactoring would add abstraction overhead without significant benefit. If we add more transport types in the future, we should consider the trait approach.

@ishandhanani ishandhanani changed the title fix: distributed tracing propagation for TCP transport and SGLang integration [MERGE AFTER SGLANG 0.5.7] fix: distributed tracing propagation for TCP transport and SGLang integration Dec 31, 2025
When spawn_prefill_task uses tokio::spawn, the spawned task loses the
current span context. This causes get_distributed_tracing_context() to
return None, preventing trace headers from being injected into prefill
requests.

Changes:
- Move tracing::Instrument import to top of file
- Capture current span and use .instrument(span) to propagate trace
  context to the spawned task
- Add prefill_routing span to track prefill routing timing
- Add kv_find_best_match span to track KV worker selection time

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Swipe4057
Copy link

@ishandhanani
Copy link
Contributor Author

Thanks @Swipe4057 - I'll probably make the bump to 0.5.7 on main branch and then merge this one in

Base automatically changed from fix/sglang-tokenizer-metrics-prometheus to main January 3, 2026 19:45
@ishandhanani ishandhanani changed the base branch from main to update-sglang-057 January 3, 2026 20:54
@ishandhanani ishandhanani changed the title [MERGE AFTER SGLANG 0.5.7] fix: distributed tracing propagation for TCP transport and SGLang integration fix: distributed tracing propagation for TCP transport and SGLang integration Jan 3, 2026
Base automatically changed from update-sglang-057 to main January 8, 2026 09:00
Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general but would like @jh-nv to review and comment.

@ishandhanani
Copy link
Contributor Author

#5346

@dagil-nvidia
Copy link
Collaborator

Auto-linked to DIS-1228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants