fix: distributed tracing propagation for TCP transport by jh-nv · Pull Request #5283 · ai-dynamo/dynamo

jh-nv · 2026-01-08T18:28:09Z

Overview:

Propagate OTEL tracing context for TCP transport.

Details:

The TCP transport was silently dropping all trace headers. Only x-endpoint-path was being used for routing, but traceparent, tracestate, x-request-id, etc. were never included in the wire protocol. This caused frontend and worker spans to appear disconnected in Tempo/Grafana.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Extract OpenTelemetry trace context from HTTP, TCP, and NATS headers for end-to-end tracing.
- TCP wire protocol now includes header blocks so trace and request IDs travel with messages.
- Request handling uses header-derived trace spans for improved observability across transports.
- Trace-related IDs (trace_id, parent_id, tracestate, x_request_id) are propagated into spans for cross-service correlation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This PR fixes distributed tracing by: 1. HTTP Frontend: Fix parent context linking for incoming requests - Add `extract_otel_context_from_http_headers()` to extract OTEL context - Update `make_request_span()` to call `span.set_parent(context)` 2. TCP Transport: Fix trace header propagation (root cause of broken tracing) - Add `headers` field to `TcpRequestMessage` wire protocol - Include trace headers (traceparent, tracestate, etc.) in TCP messages - Add `make_handle_payload_span_from_tcp_headers()` for worker-side span creation The TCP transport was silently dropping all trace headers - only using `x-endpoint-path` for routing. This caused frontend and worker spans to be disconnected. NATS transport worked correctly due to native header support.

…cing_DYN-1720

coderabbitai · 2026-01-08T18:31:52Z

Walkthrough

Adds OpenTelemetry context extraction and propagation across HTTP, TCP, and NATS; extends the TCP wire protocol to carry JSON-encoded headers; and uses extracted context to create/attach parent spans for request handling.

Changes

Cohort / File(s)	Summary
Tracing & OTEL helpers `lib/runtime/src/logging.rs`	Added centralized `TRACE_PROPAGATOR`, functions to extract OTEL context from HTTP/TCP/NATS headers (`extract_otel_context_from_http_headers`, `extract_otel_context_from_tcp_headers`, `extract_otel_context_from_nats_headers`) and a new span factory `make_handle_payload_span_from_tcp_headers` that attaches extracted context as a parent.
TCP wire format / codec `lib/runtime/src/pipeline/network/codec.rs`	`TcpRequestMessage` now includes `pub headers: HashMap<String,String>` and `with_headers(...)`; encoder/decoder updated to serialize/parse `headers_len` (u16) and JSON-encoded headers between `endpoint_path` and `payload`, with updated size calculations and error handling.
TCP egress / tests `lib/runtime/src/pipeline/network/egress/tcp_client.rs`, tests/mocks...	`TcpConnection::send_request` now constructs requests via `TcpRequestMessage::with_headers(...)`; tests and mock server reads/writes updated to expect the additional headers segment.
TCP ingress / request handling `lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs`	Read loop and message reconstruction now parse headers segment; per-request task uses `make_handle_payload_span_from_tcp_headers(...)` to create an instrumenting span (replacing static info_span) and buffers capacity calculations updated to include headers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I nibble headers, stitch a thread so neat,
Traces hop along each TCP beat,
Parent span snug in a context bed,
From HTTP, NATS, the path is fed,
Rabbit hops—observability is sweet. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding distributed tracing propagation support for TCP transport, which is the primary focus of the changeset.
Description check	✅ Passed	The description covers Overview and Details sections with clear context about the problem (trace headers being dropped) and solution. However, the 'Where should the reviewer start?' section is empty, and the related issues section has a placeholder.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc074e0 and 8ff07a5.

📒 Files selected for processing (2)

lib/runtime/src/logging.rs
lib/runtime/src/pipeline/network/egress/tcp_client.rs

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-11T03:24:47.820Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

lib/runtime/src/pipeline/network/egress/tcp_client.rs

🧬 Code graph analysis (1)

lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)

lib/runtime/src/pipeline/network/codec.rs (1)

with_headers (46-56)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: vllm (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (amd64)
GitHub Check: tests (lib/bindings/python)
GitHub Check: tests (launch/dynamo-run)
GitHub Check: tests (.)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (7)

lib/runtime/src/logging.rs (5)

1078-1081: LGTM! Good optimization for propagator reuse.

The static TRACE_PROPAGATOR centralizes W3C Trace Context propagation and avoids repeated allocations. This is used consistently across all transport types (HTTP, TCP, NATS).

301-335: LGTM! Robust OTEL context extraction for HTTP.

The implementation properly handles edge cases:

Returns None if traceparent header is missing or empty

Validates the extracted span context before returning

Follows the same pattern as the NATS extraction function

380-423: LGTM! TCP header span creation mirrors NATS implementation.

The function correctly:

Extracts OTEL context and trace identifiers from TCP headers

Creates spans with full trace context when available

Falls back to spans without trace context gracefully

Sets the parent context for proper distributed trace linking

425-465: LGTM! TCP context extraction follows established patterns.

The implementation:

Correctly extracts OTEL context from HashMap headers

Matches the pattern used in extract_otel_context_from_nats_headers

Validates the span context before returning

Returns the tuple format expected by the span creation function

281-296: LGTM! HTTP request spans now properly link to parent traces.

The change extracts OTEL context from HTTP headers and sets it as the span's parent when available. This enables distributed tracing across HTTP boundaries.

lib/runtime/src/pipeline/network/egress/tcp_client.rs (2)

669-675: LGTM! Test mock servers properly updated for new wire format.

All three test cases correctly handle the new wire protocol that includes headers:

Read 2-byte headers length

Read headers payload

Continue with existing payload reading

The consistent pattern across all tests ensures the protocol change is properly validated.

Also applies to: 741-750, 850-859

327-329: The change correctly propagates trace headers through TCP transport via headers.clone() on line 328.

The implementation is sound for current usage. Headers are a simple HashMap<String, String> that currently remain small (typically just x-endpoint-path and trace headers). The with_headers method requires ownership of the HashMap, making the clone at the call site necessary. While the clone happens on the hot path, the small header size (1-2 entries in current usage) makes this negligible. This remains efficient for the foreseeable future.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)
662-675: Tests use outdated wire format - will fail with new header-aware protocol.

The mock server reads path_len → path → payload_len → payload, but the new wire format is path_len → path → headers_len → headers → payload_len → payload. The test will read headers_len bytes as payload_len, causing incorrect behavior or hangs.

This affects test_connection_health_check, test_concurrent_requests_single_connection, and test_connection_pool_reuse.
🔧 Proposed fix for test_connection_health_check mock server
             // Read request
             let mut len_buf = [0u8; 2];
             read_half.read_exact(&mut len_buf).await.unwrap();
             let path_len = u16::from_be_bytes(len_buf) as usize;

             let mut path_buf = vec![0u8; path_len];
             read_half.read_exact(&mut path_buf).await.unwrap();

+            // Read headers length and headers (new wire format)
+            let mut headers_len_buf = [0u8; 2];
+            read_half.read_exact(&mut headers_len_buf).await.unwrap();
+            let headers_len = u16::from_be_bytes(headers_len_buf) as usize;
+
+            let mut headers_buf = vec![0u8; headers_len];
+            read_half.read_exact(&mut headers_buf).await.unwrap();
+
             let mut len_buf = [0u8; 4];
             read_half.read_exact(&mut len_buf).await.unwrap();
             let payload_len = u32::from_be_bytes(len_buf) as usize;

🧹 Nitpick comments (2)

lib/runtime/src/pipeline/network/codec.rs (1)

571-582: Consider adding tests for header propagation.

The existing tests use TcpRequestMessage::new() which creates empty headers. Consider adding tests that verify:

with_headers() constructor works correctly
Non-empty headers encode/decode round-trip
Header size limit validation (> u16::MAX)

📝 Example test for headers

#[test]
fn test_tcp_request_with_headers() {
    let mut headers = std::collections::HashMap::new();
    headers.insert("traceparent".to_string(), "00-abc123-def456-01".to_string());
    headers.insert("x-request-id".to_string(), "req-123".to_string());
    
    let msg = TcpRequestMessage::with_headers(
        "test.endpoint".to_string(),
        headers,
        Bytes::from(vec![1, 2, 3]),
    );

    let encoded = msg.encode().unwrap();
    let decoded = TcpRequestMessage::decode(&encoded).unwrap();

    assert_eq!(decoded.endpoint_path, "test.endpoint");
    assert_eq!(decoded.headers.get("traceparent"), Some(&"00-abc123-def456-01".to_string()));
    assert_eq!(decoded.headers.get("x-request-id"), Some(&"req-123".to_string()));
    assert_eq!(decoded.payload, Bytes::from(vec![1, 2, 3]));
}

lib/runtime/src/logging.rs (1)

426-510: Consider consolidating header extractors to reduce duplication.

The three extraction functions (extract_otel_context_from_http_headers, extract_otel_context_from_tcp_headers, extract_otel_context_from_nats_headers) follow identical patterns. Consider a generic approach using the existing GenericHeaders trait.

♻️ Possible consolidation approach

fn extract_otel_context_from_headers<H: GenericHeaders>(
    headers: &H,
) -> (Option<opentelemetry::Context>, Option<String>, Option<String>) {
    let traceparent_value = match headers.get("traceparent") {
        Some(value) => value,
        None => return (None, None, None),
    };

    let (trace_id, parent_span_id) = parse_traceparent(traceparent_value);

    struct GenericExtractor<'a, H>(&'a H);

    impl<'a, H: GenericHeaders> Extractor for GenericExtractor<'a, H> {
        fn get(&self, key: &str) -> Option<&str> {
            self.0.get(key)
        }
        fn keys(&self) -> Vec<&str> {
            vec!["traceparent", "tracestate"]
                .into_iter()
                .filter(|&key| self.0.get(key).is_some())
                .collect()
        }
    }

    let extractor = GenericExtractor(headers);
    let propagator = opentelemetry_sdk::propagation::TraceContextPropagator::new();
    let otel_context = propagator.extract(&extractor);

    let context_with_trace = if otel_context.span().span_context().is_valid() {
        Some(otel_context)
    } else {
        None
    };

    (context_with_trace, trace_id, parent_span_id)
}

Note: This would require extending GenericHeaders to support HashMap<String, String> or creating a wrapper.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfb95df and dc074e0.

📒 Files selected for processing (4)

lib/runtime/src/logging.rs
lib/runtime/src/pipeline/network/codec.rs
lib/runtime/src/pipeline/network/egress/tcp_client.rs
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-11T03:24:47.820Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

lib/runtime/src/pipeline/network/egress/tcp_client.rs
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs

🧬 Code graph analysis (2)

lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)

lib/runtime/src/pipeline/network/codec.rs (1)

with_headers (46-56)

lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1)

lib/runtime/src/logging.rs (1)

make_handle_payload_span_from_tcp_headers (382-424)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: vllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: Build and Test - dynamo
GitHub Check: tests (.)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: clippy (.)
GitHub Check: tests (lib/bindings/python)
GitHub Check: tests (launch/dynamo-run)

🔇 Additional comments (14)

lib/runtime/src/pipeline/network/egress/tcp_client.rs (1)

327-329: LGTM! Proper header propagation for distributed tracing.

The change correctly switches to with_headers to include trace headers (traceparent, tracestate, x-request-id, etc.) in the TCP wire protocol, enabling end-to-end trace continuity across frontend and worker services.

lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (3)

269-276: LGTM! Correct wire protocol extension for headers.

The header reading logic correctly follows the new wire format, reading headers_len (2 bytes) followed by the headers payload after the endpoint path.

305-312: LGTM! Correct capacity calculation and message reconstruction.

The capacity calculation properly accounts for all segments including the new headers field, and the message reconstruction correctly assembles the complete buffer for decoding.

377-389: LGTM! Dynamic span creation enables distributed tracing across TCP transport.

The span creation correctly extracts OTEL context from TCP headers and links it as the parent span, enabling trace continuity between frontend and worker services.

lib/runtime/src/pipeline/network/codec.rs (6)

21-35: LGTM! Clear wire format documentation and struct definition.

The wire format documentation is well-structured, and the headers field addition with HashMap<String, String> is appropriate for propagating trace headers.

37-56: LGTM! Backward-compatible API with new header support.

The new() constructor maintains backward compatibility by initializing an empty headers map, while with_headers() provides explicit header propagation when needed.

70-117: LGTM! Proper header encoding with size validation.

The JSON serialization approach is flexible and the u16::MAX size check ensures the headers fit within the wire format constraints. The capacity calculation and write order are correct.

149-212: LGTM! Robust header decoding with proper bounds checking.

The decode logic correctly reads headers_len, validates buffer bounds, and parses the JSON headers. Error messages are descriptive and offset tracking is accurate throughout.

241-322: LGTM! Codec Decoder correctly handles new wire format.

The peek logic, header size calculation, and decoding flow are all correctly updated to handle the headers segment in the wire protocol.

339-400: LGTM! Codec Encoder properly handles headers.

The encoder correctly serializes headers to JSON, validates size constraints, and writes them in the proper position within the wire format.

lib/runtime/src/logging.rs (4)

275-299: LGTM! HTTP request span now properly linked to parent trace context.

The make_request_span function correctly extracts OTEL context from HTTP headers and sets it as the parent context for the span, enabling trace continuity for incoming HTTP requests.

301-336: LGTM! Proper OTEL context extraction from HTTP headers.

The implementation correctly uses the W3C Trace Context propagator, validates the traceparent header, and ensures only valid span contexts are returned. The pattern aligns with the existing NATS header extraction.

381-424: LGTM! Proper span creation from TCP headers for distributed tracing.

The function correctly mirrors the NATS make_handle_payload_span pattern, extracting OTEL context and creating appropriately linked spans for TCP payload handling.

426-467: LGTM! TCP header OTEL context extraction follows established patterns.

The implementation correctly extracts W3C Trace Context from HashMap<String, String> headers, returning both the OTEL context and individual trace identifiers for logging purposes.

lib/runtime/src/logging.rs

jh-nv · 2026-01-08T20:36:29Z

@coderabbitai incremental review

coderabbitai · 2026-01-08T20:36:47Z

@jh-nv, I'll review the changes in this PR.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…cing_DYN-1720

Co-authored-by: Ishan Dhanani <ishandhanani@gmail.com>

jh-nv and others added 3 commits January 8, 2026 13:12

go

27ef9da

Merge branch 'main' of github.com:ai-dynamo/dynamo into jihao/tcp_tra…

dc074e0

…cing_DYN-1720

jh-nv requested a review from a team as a code owner January 8, 2026 18:28

pull-request-size bot added the size/L label Jan 8, 2026

github-actions bot added the fix label Jan 8, 2026

jh-nv mentioned this pull request Jan 8, 2026

fix: Translate http headers to TCP to propagate the OTEL context. #5261

Closed

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

nnshah1 reviewed Jan 8, 2026

View reviewed changes

lib/runtime/src/logging.rs Outdated Show resolved Hide resolved

nnshah1 approved these changes Jan 8, 2026

View reviewed changes

nnshah1 requested a review from ishandhanani January 8, 2026 18:59

Move TraceContextPropagator to static singleton

6e4c515

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 19:38 Inactive

fix

8ff07a5

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 20:31 Inactive

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 20:35 Inactive

Merge branch 'main' of github.com:ai-dynamo/dynamo into jihao/tcp_tra…

f088572

…cing_DYN-1720

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 22:29 Inactive

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 22:31 Inactive

jh-nv merged commit ace35a8 into main Jan 9, 2026
37 of 38 checks passed

jh-nv deleted the jihao/tcp_tracing_DYN-1720 branch January 9, 2026 00:12

jh-nv added a commit that referenced this pull request Jan 9, 2026

fix: distributed tracing propagation for TCP transport (#5283)

a34436e

Co-authored-by: Ishan Dhanani <ishandhanani@gmail.com>

jh-nv mentioned this pull request Jan 9, 2026

fix: distributed tracing propagation for TCP transport (#5283) #5300

Merged

nv-anants pushed a commit that referenced this pull request Jan 9, 2026

fix: distributed tracing propagation for TCP transport (#5283) (#5300)

fe2f7e6

Co-authored-by: Ishan Dhanani <ishandhanani@gmail.com>

ishandhanani mentioned this pull request Jan 12, 2026

fix(sglang): use external_trace_header API for distributed tracing #5346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: distributed tracing propagation for TCP transport#5283

fix: distributed tracing propagation for TCP transport#5283
jh-nv merged 6 commits intomainfrom
jihao/tcp_tracing_DYN-1720

jh-nv commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

jh-nv commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jh-nv commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jh-nv commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jh-nv commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading