Skip to content

fix: distributed tracing propagation for TCP transport#5283

Merged
jh-nv merged 6 commits intomainfrom
jihao/tcp_tracing_DYN-1720
Jan 9, 2026
Merged

fix: distributed tracing propagation for TCP transport#5283
jh-nv merged 6 commits intomainfrom
jihao/tcp_tracing_DYN-1720

Conversation

@jh-nv
Copy link
Copy Markdown
Contributor

@jh-nv jh-nv commented Jan 8, 2026

Overview:

Propagate OTEL tracing context for TCP transport.

Details:

The TCP transport was silently dropping all trace headers. Only x-endpoint-path was being used for routing, but traceparent, tracestate, x-request-id, etc. were never included in the wire protocol. This caused frontend and worker spans to appear disconnected in Tempo/Grafana.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Extract OpenTelemetry trace context from HTTP, TCP, and NATS headers for end-to-end tracing.
    • TCP wire protocol now includes header blocks so trace and request IDs travel with messages.
    • Request handling uses header-derived trace spans for improved observability across transports.
    • Trace-related IDs (trace_id, parent_id, tracestate, x_request_id) are propagated into spans for cross-service correlation.

✏️ Tip: You can customize this high-level summary in your review settings.

jh-nv and others added 3 commits January 8, 2026 13:12
This PR fixes distributed tracing by:

1. HTTP Frontend: Fix parent context linking for incoming requests
   - Add `extract_otel_context_from_http_headers()` to extract OTEL context
   - Update `make_request_span()` to call `span.set_parent(context)`

2. TCP Transport: Fix trace header propagation (root cause of broken tracing)
   - Add `headers` field to `TcpRequestMessage` wire protocol
   - Include trace headers (traceparent, tracestate, etc.) in TCP messages
   - Add `make_handle_payload_span_from_tcp_headers()` for worker-side span creation

The TCP transport was silently dropping all trace headers - only using
`x-endpoint-path` for routing. This caused frontend and worker spans to be
disconnected. NATS transport worked correctly due to native header support.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 8, 2026

Walkthrough

Adds OpenTelemetry context extraction and propagation across HTTP, TCP, and NATS; extends the TCP wire protocol to carry JSON-encoded headers; and uses extracted context to create/attach parent spans for request handling.

Changes

Cohort / File(s) Summary
Tracing & OTEL helpers
lib/runtime/src/logging.rs
Added centralized TRACE_PROPAGATOR, functions to extract OTEL context from HTTP/TCP/NATS headers (extract_otel_context_from_http_headers, extract_otel_context_from_tcp_headers, extract_otel_context_from_nats_headers) and a new span factory make_handle_payload_span_from_tcp_headers that attaches extracted context as a parent.
TCP wire format / codec
lib/runtime/src/pipeline/network/codec.rs
TcpRequestMessage now includes pub headers: HashMap<String,String> and with_headers(...); encoder/decoder updated to serialize/parse headers_len (u16) and JSON-encoded headers between endpoint_path and payload, with updated size calculations and error handling.
TCP egress / tests
lib/runtime/src/pipeline/network/egress/tcp_client.rs, tests/mocks...
TcpConnection::send_request now constructs requests via TcpRequestMessage::with_headers(...); tests and mock server reads/writes updated to expect the additional headers segment.
TCP ingress / request handling
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
Read loop and message reconstruction now parse headers segment; per-request task uses make_handle_payload_span_from_tcp_headers(...) to create an instrumenting span (replacing static info_span) and buffers capacity calculations updated to include headers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I nibble headers, stitch a thread so neat,
Traces hop along each TCP beat,
Parent span snug in a context bed,
From HTTP, NATS, the path is fed,
Rabbit hops—observability is sweet. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding distributed tracing propagation support for TCP transport, which is the primary focus of the changeset.
Description check ✅ Passed The description covers Overview and Details sections with clear context about the problem (trace headers being dropped) and solution. However, the 'Where should the reviewer start?' section is empty, and the related issues section has a placeholder.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc074e0 and 8ff07a5.

📒 Files selected for processing (2)
  • lib/runtime/src/logging.rs
  • lib/runtime/src/pipeline/network/egress/tcp_client.rs
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

  • lib/runtime/src/pipeline/network/egress/tcp_client.rs
🧬 Code graph analysis (1)
lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)
lib/runtime/src/pipeline/network/codec.rs (1)
  • with_headers (46-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (.)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (7)
lib/runtime/src/logging.rs (5)

1078-1081: LGTM! Good optimization for propagator reuse.

The static TRACE_PROPAGATOR centralizes W3C Trace Context propagation and avoids repeated allocations. This is used consistently across all transport types (HTTP, TCP, NATS).


301-335: LGTM! Robust OTEL context extraction for HTTP.

The implementation properly handles edge cases:

  • Returns None if traceparent header is missing or empty
  • Validates the extracted span context before returning
  • Follows the same pattern as the NATS extraction function

380-423: LGTM! TCP header span creation mirrors NATS implementation.

The function correctly:

  • Extracts OTEL context and trace identifiers from TCP headers
  • Creates spans with full trace context when available
  • Falls back to spans without trace context gracefully
  • Sets the parent context for proper distributed trace linking

425-465: LGTM! TCP context extraction follows established patterns.

The implementation:

  • Correctly extracts OTEL context from HashMap headers
  • Matches the pattern used in extract_otel_context_from_nats_headers
  • Validates the span context before returning
  • Returns the tuple format expected by the span creation function

281-296: LGTM! HTTP request spans now properly link to parent traces.

The change extracts OTEL context from HTTP headers and sets it as the span's parent when available. This enables distributed tracing across HTTP boundaries.

lib/runtime/src/pipeline/network/egress/tcp_client.rs (2)

669-675: LGTM! Test mock servers properly updated for new wire format.

All three test cases correctly handle the new wire protocol that includes headers:

  1. Read 2-byte headers length
  2. Read headers payload
  3. Continue with existing payload reading

The consistent pattern across all tests ensures the protocol change is properly validated.

Also applies to: 741-750, 850-859


327-329: The change correctly propagates trace headers through TCP transport via headers.clone() on line 328.

The implementation is sound for current usage. Headers are a simple HashMap<String, String> that currently remain small (typically just x-endpoint-path and trace headers). The with_headers method requires ownership of the HashMap, making the clone at the call site necessary. While the clone happens on the hot path, the small header size (1-2 entries in current usage) makes this negligible. This remains efficient for the foreseeable future.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)

662-675: Tests use outdated wire format - will fail with new header-aware protocol.

The mock server reads path_len → path → payload_len → payload, but the new wire format is path_len → path → headers_len → headers → payload_len → payload. The test will read headers_len bytes as payload_len, causing incorrect behavior or hangs.

This affects test_connection_health_check, test_concurrent_requests_single_connection, and test_connection_pool_reuse.

🔧 Proposed fix for test_connection_health_check mock server
             // Read request
             let mut len_buf = [0u8; 2];
             read_half.read_exact(&mut len_buf).await.unwrap();
             let path_len = u16::from_be_bytes(len_buf) as usize;

             let mut path_buf = vec![0u8; path_len];
             read_half.read_exact(&mut path_buf).await.unwrap();

+            // Read headers length and headers (new wire format)
+            let mut headers_len_buf = [0u8; 2];
+            read_half.read_exact(&mut headers_len_buf).await.unwrap();
+            let headers_len = u16::from_be_bytes(headers_len_buf) as usize;
+
+            let mut headers_buf = vec![0u8; headers_len];
+            read_half.read_exact(&mut headers_buf).await.unwrap();
+
             let mut len_buf = [0u8; 4];
             read_half.read_exact(&mut len_buf).await.unwrap();
             let payload_len = u32::from_be_bytes(len_buf) as usize;
🧹 Nitpick comments (2)
lib/runtime/src/pipeline/network/codec.rs (1)

571-582: Consider adding tests for header propagation.

The existing tests use TcpRequestMessage::new() which creates empty headers. Consider adding tests that verify:

  • with_headers() constructor works correctly
  • Non-empty headers encode/decode round-trip
  • Header size limit validation (> u16::MAX)
📝 Example test for headers
#[test]
fn test_tcp_request_with_headers() {
    let mut headers = std::collections::HashMap::new();
    headers.insert("traceparent".to_string(), "00-abc123-def456-01".to_string());
    headers.insert("x-request-id".to_string(), "req-123".to_string());
    
    let msg = TcpRequestMessage::with_headers(
        "test.endpoint".to_string(),
        headers,
        Bytes::from(vec![1, 2, 3]),
    );

    let encoded = msg.encode().unwrap();
    let decoded = TcpRequestMessage::decode(&encoded).unwrap();

    assert_eq!(decoded.endpoint_path, "test.endpoint");
    assert_eq!(decoded.headers.get("traceparent"), Some(&"00-abc123-def456-01".to_string()));
    assert_eq!(decoded.headers.get("x-request-id"), Some(&"req-123".to_string()));
    assert_eq!(decoded.payload, Bytes::from(vec![1, 2, 3]));
}
lib/runtime/src/logging.rs (1)

426-510: Consider consolidating header extractors to reduce duplication.

The three extraction functions (extract_otel_context_from_http_headers, extract_otel_context_from_tcp_headers, extract_otel_context_from_nats_headers) follow identical patterns. Consider a generic approach using the existing GenericHeaders trait.

♻️ Possible consolidation approach
fn extract_otel_context_from_headers<H: GenericHeaders>(
    headers: &H,
) -> (Option<opentelemetry::Context>, Option<String>, Option<String>) {
    let traceparent_value = match headers.get("traceparent") {
        Some(value) => value,
        None => return (None, None, None),
    };

    let (trace_id, parent_span_id) = parse_traceparent(traceparent_value);

    struct GenericExtractor<'a, H>(&'a H);

    impl<'a, H: GenericHeaders> Extractor for GenericExtractor<'a, H> {
        fn get(&self, key: &str) -> Option<&str> {
            self.0.get(key)
        }
        fn keys(&self) -> Vec<&str> {
            vec!["traceparent", "tracestate"]
                .into_iter()
                .filter(|&key| self.0.get(key).is_some())
                .collect()
        }
    }

    let extractor = GenericExtractor(headers);
    let propagator = opentelemetry_sdk::propagation::TraceContextPropagator::new();
    let otel_context = propagator.extract(&extractor);

    let context_with_trace = if otel_context.span().span_context().is_valid() {
        Some(otel_context)
    } else {
        None
    };

    (context_with_trace, trace_id, parent_span_id)
}

Note: This would require extending GenericHeaders to support HashMap<String, String> or creating a wrapper.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfb95df and dc074e0.

📒 Files selected for processing (4)
  • lib/runtime/src/logging.rs
  • lib/runtime/src/pipeline/network/codec.rs
  • lib/runtime/src/pipeline/network/egress/tcp_client.rs
  • lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

  • lib/runtime/src/pipeline/network/egress/tcp_client.rs
  • lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
🧬 Code graph analysis (2)
lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)
lib/runtime/src/pipeline/network/codec.rs (1)
  • with_headers (46-56)
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1)
lib/runtime/src/logging.rs (1)
  • make_handle_payload_span_from_tcp_headers (382-424)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (.)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
🔇 Additional comments (14)
lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)

327-329: LGTM! Proper header propagation for distributed tracing.

The change correctly switches to with_headers to include trace headers (traceparent, tracestate, x-request-id, etc.) in the TCP wire protocol, enabling end-to-end trace continuity across frontend and worker services.

lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (3)

269-276: LGTM! Correct wire protocol extension for headers.

The header reading logic correctly follows the new wire format, reading headers_len (2 bytes) followed by the headers payload after the endpoint path.


305-312: LGTM! Correct capacity calculation and message reconstruction.

The capacity calculation properly accounts for all segments including the new headers field, and the message reconstruction correctly assembles the complete buffer for decoding.


377-389: LGTM! Dynamic span creation enables distributed tracing across TCP transport.

The span creation correctly extracts OTEL context from TCP headers and links it as the parent span, enabling trace continuity between frontend and worker services.

lib/runtime/src/pipeline/network/codec.rs (6)

21-35: LGTM! Clear wire format documentation and struct definition.

The wire format documentation is well-structured, and the headers field addition with HashMap<String, String> is appropriate for propagating trace headers.


37-56: LGTM! Backward-compatible API with new header support.

The new() constructor maintains backward compatibility by initializing an empty headers map, while with_headers() provides explicit header propagation when needed.


70-117: LGTM! Proper header encoding with size validation.

The JSON serialization approach is flexible and the u16::MAX size check ensures the headers fit within the wire format constraints. The capacity calculation and write order are correct.


149-212: LGTM! Robust header decoding with proper bounds checking.

The decode logic correctly reads headers_len, validates buffer bounds, and parses the JSON headers. Error messages are descriptive and offset tracking is accurate throughout.


241-322: LGTM! Codec Decoder correctly handles new wire format.

The peek logic, header size calculation, and decoding flow are all correctly updated to handle the headers segment in the wire protocol.


339-400: LGTM! Codec Encoder properly handles headers.

The encoder correctly serializes headers to JSON, validates size constraints, and writes them in the proper position within the wire format.

lib/runtime/src/logging.rs (4)

275-299: LGTM! HTTP request span now properly linked to parent trace context.

The make_request_span function correctly extracts OTEL context from HTTP headers and sets it as the parent context for the span, enabling trace continuity for incoming HTTP requests.


301-336: LGTM! Proper OTEL context extraction from HTTP headers.

The implementation correctly uses the W3C Trace Context propagator, validates the traceparent header, and ensures only valid span contexts are returned. The pattern aligns with the existing NATS header extraction.


381-424: LGTM! Proper span creation from TCP headers for distributed tracing.

The function correctly mirrors the NATS make_handle_payload_span pattern, extracting OTEL context and creating appropriately linked spans for TCP payload handling.


426-467: LGTM! TCP header OTEL context extraction follows established patterns.

The implementation correctly extracts W3C Trace Context from HashMap<String, String> headers, returning both the OTEL context and individual trace identifiers for logging purposes.

@nnshah1 nnshah1 requested a review from ishandhanani January 8, 2026 18:59
@jh-nv
Copy link
Copy Markdown
Contributor Author

jh-nv commented Jan 8, 2026

@coderabbitai incremental review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 8, 2026

@jh-nv, I'll review the changes in this PR.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@jh-nv jh-nv merged commit ace35a8 into main Jan 9, 2026
37 of 38 checks passed
@jh-nv jh-nv deleted the jihao/tcp_tracing_DYN-1720 branch January 9, 2026 00:12
jh-nv added a commit that referenced this pull request Jan 9, 2026
Co-authored-by: Ishan Dhanani <ishandhanani@gmail.com>
nv-anants pushed a commit that referenced this pull request Jan 9, 2026
Co-authored-by: Ishan Dhanani <ishandhanani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants