feat: updates to structured logging #2061

nnshah1 · 2025-07-22T23:29:55Z

Overview:

Updates JSONL formats to include

span start and close events
inject trace id and span id
enable capturing context to distribution
reformat duration as microseconds
Update http handlers to demonstrate implementation

Todo: Follow with instrumentation in frontend and request handling
Todo: Investigate if metrics can be collected easily on span creation

Example:

These insrumentations:

 #[tracing::instrument(
        skip_all,
        fields(
            span_id = "abd16e319329445f",
            trace_id = "2adfd24468724599bb9a4990dc342288"
        )
    )]
    async fn parent() {
        tracing::Span::current().record("trace_id", "invalid");
        tracing::Span::current().record("span_id", "invalid");
        tracing::Span::current().record("span_name", "invalid");
        tracing::trace!(message = "parent!");
        if let Some(my_ctx) = get_distributed_tracing_context() {
            tracing::info!(my_trace_id = my_ctx.trace_id);
        }
        child().await;
    }

    #[tracing::instrument(skip_all)]
    async fn child() {
        tracing::trace!(message = "child");
        if let Some(my_ctx) = get_distributed_tracing_context() {
            tracing::info!(my_trace_id = my_ctx.trace_id);
        }
        grandchild().await;
    }

    #[tracing::instrument(skip_all)]
    async fn grandchild() {
        tracing::trace!(message = "grandchild");
        if let Some(my_ctx) = get_distributed_tracing_context() {
            tracing::info!(my_trace_id = my_ctx.trace_id);
        }
    }

Generate these logs:

{"file":"lib/runtime/src/logging.rs","level":"INFO","line":727,"message":"SPAN_CREATED","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262300Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":740,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262362Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":745,"message":"SPAN_CREATED","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262408Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":749,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262420Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":754,"message":"SPAN_CREATED","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262432Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":758,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262441Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":754,"message":"SPAN_CLOSED","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262451Z","time.busy_us":9,"time.duration_us":18,"time.idle_us":9,"trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":745,"message":"SPAN_CLOSED","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262915Z","time.busy_us":495,"time.duration_us":506,"time.idle_us":11,"trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":727,"message":"SPAN_CLOSED","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262932Z","time.busy_us":577,"time.duration_us":640,"time.idle_us":63,"trace_id":"2adfd24468724599bb9a4990dc342288"}

where log lines have the following schema:

 {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "title": "Runtime Log Line",
      "type": "object",
      "required": [
        "file",
        "level",
        "line",
        "message",
        "target",
        "time"
      ],
      "properties": {
        "file":      { "type": "string" },
        "level":     { "type": "string", "enum": ["ERROR", "WARN", "INFO", "DEBUG", "TRACE"] },
        "line":      { "type": "integer" },
        "message":   { "type": "string" },
        "target":    { "type": "string" },
        "time":      { "type": "string", "format": "date-time" },
        "span_id":   { "type": "string", "pattern": "^[a-f0-9]{16}$" },
        "parent_id": { "type": "string", "pattern": "^[a-f0-9]{16}$" },
        "trace_id":  { "type": "string", "pattern": "^[a-f0-9]{32}$" },
        "span_name": { "type": "string" },
        "time.busy_us":     { "type": "integer" },
        "time.duration_us": { "type": "integer" },
        "time.idle_us":     { "type": "integer" }
      },
      "additionalProperties": true
    }

Details:

Where should the reviewer start?

logging.rs and http_server.rs

Ignore others - will add as separate PR

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Integrated distributed tracing and context propagation throughout the HTTP server, including support for W3C-compliant trace and span IDs.
- Enhanced JSON log output with distributed tracing fields and improved formatting, including microsecond precision and structured data handling.
- Added public instance identifier to push endpoints for improved traceability.
Bug Fixes
- Improved trace context handling and propagation for health and metrics endpoints.
Tests
- Added comprehensive tests for distributed tracing, log schema validation, and HTTP endpoint traceability.
Chores
- Added new development dependencies for testing and validation.

…gging

copy-pr-bot · 2025-07-22T23:29:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-07-24T21:02:31Z

Walkthrough

This set of changes introduces distributed tracing and enhanced observability across several Rust modules, particularly within the HTTP server, logging, and pipeline network components. New tracing instrumentation, context propagation, and structured logging are implemented, along with development dependency additions and minor builder enhancements. Test coverage for tracing and logging is also expanded.

Changes

File(s)	Change Summary
lib/bindings/python/rust/engine.rs	Added `#[tracing::instrument(skip_all)]` to the `generate` async method and instrumented spawned tasks for tracing context propagation in Python async generator stream handling.
lib/runtime/Cargo.toml	Added `stdio-override` and `jsonschema` as development dependencies.
lib/runtime/src/component/endpoint.rs	Appended `.instance_id(lease_id)` to the `PushEndpoint` builder chain in `EndpointConfigBuilder::start`.
lib/runtime/src/http_server.rs	Integrated distributed tracing context propagation into HTTP routes and handlers. Updated handler signatures to accept tracing parameters, instrumented fallback handler, and expanded tests to validate tracing and logging.
lib/runtime/src/logging.rs	Major enhancements: implemented distributed tracing support, W3C trace/span ID generation, `DistributedTraceIdLayer` for span context propagation, `TraceParent` extractor for HTTP, extended JSON log formatter with tracing fields and duration parsing, improved structured logging, and added comprehensive async tests for tracing and log schema validation.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs	Added public `instance_id: LeaseId` field to `PushEndpoint`. Updated tracing logs to use `instance_id` instead of the removed `worker_id` variable.
lib/runtime/src/pipeline/network/ingress/push_handler.rs	Added `#[tracing::instrument(skip_all)]` to the `handle_payload` async method for enhanced tracing instrumentation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTPServer
    participant TraceParent
    participant Handler
    participant Logger

    Client->>HTTPServer: Sends HTTP request (with trace headers)
    HTTPServer->>TraceParent: Extracts trace context from headers
    HTTPServer->>Handler: Calls handler with trace context
    Handler->>Logger: Emits logs with trace/span IDs
    Handler-->>HTTPServer: Returns response
    HTTPServer-->>Client: Sends HTTP response

sequenceDiagram
    participant RustAsync
    participant Tracing
    participant Task

    RustAsync->>Tracing: Enters instrumented async function
    RustAsync->>Task: Spawns async/blocking task .in_current_span()
    Task->>Tracing: Executes within current tracing span

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

In the garden of logs where the trace-ids bloom,
Spans now wander from server to room.
With context and color, each hop leaves a mark,
As rabbits observe in the code’s glowing dark.
JSON lines sparkle, the network's in tune—
Distributed carrots, harvested soon!
🥕✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

lib/runtime/Cargo.toml (1)
73-74: Fix formatting inconsistency in dependency declarations.

The new dependencies lack spaces around = which is inconsistent with the rest of the file.
-stdio-override = {version= "0.2.0"}
-jsonschema = {version = "0.17"}
+stdio-override = { version = "0.2.0" }
+jsonschema = { version = "0.17" }
lib/runtime/src/http_server.rs (1)

420-479: Test implementation is incomplete.

The test sets up tracing but doesn't verify that trace IDs are actually propagated or logged correctly. The TODO comment on lines 439-440 acknowledges this gap.

Would you like me to help implement the trace ID verification logic? This could include:

Capturing and parsing the JSONL logs

Verifying trace_id and span_id fields are present and properly formatted

Checking parent-child span relationships

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c642fd and 093fb4e.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
lib/bindings/python/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

lib/bindings/python/rust/engine.rs (4 hunks)
lib/runtime/Cargo.toml (1 hunks)
lib/runtime/src/component/endpoint.rs (1 hunks)
lib/runtime/src/http_server.rs (9 hunks)
lib/runtime/src/logging.rs (12 hunks)
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2 hunks)
lib/runtime/src/pipeline/network/ingress/push_handler.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

lib/runtime/Cargo.toml (1)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

lib/runtime/src/pipeline/network/ingress/push_handler.rs (1)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

lib/bindings/python/rust/engine.rs (5)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.

Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

Learnt from: t-ob
PR: #1290
File: launch/dynamo-run/src/subprocess/sglang_inc.py:80-110
Timestamp: 2025-06-03T10:17:51.711Z
Learning: The sglang async_encode method does not support streaming options, so collecting all embeddings before yielding is the correct approach for embedding requests.

lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2)

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/runtime/src/http_server.rs (3)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

lib/runtime/src/logging.rs (2)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

🔇 Additional comments (15)

lib/runtime/src/logging.rs (2)

645-654: LGTM! Smart approach to preserve structured data.

The JSON deserialization attempt for non-message fields is a clever way to preserve structured data in logs while keeping the message field as a plain string.

793-985: Excellent comprehensive test coverage!

The test thoroughly validates distributed tracing functionality including JSON schema compliance, trace propagation, span relationships, and timing consistency.

lib/runtime/src/component/endpoint.rs (1)

106-110: LGTM! Proper instance identification for tracing.

The addition of instance_id(lease_id) correctly passes the lease ID to the PushEndpoint for use in distributed tracing logs.

lib/runtime/src/pipeline/network/ingress/push_handler.rs (1)

25-26: LGTM! Appropriate tracing instrumentation.

Good use of skip_all to avoid recording potentially large or sensitive payload data in traces.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)

94-99: Fix field reference in tracing macros.

The instance_id field references are missing the self. prefix.
-                    tracing::trace!(self.instance_id, "handling new request");
+                    tracing::trace!(instance_id = self.instance_id, "handling new request");
                     let result = ingress.handle_payload(req.message.payload).await;
                     match result {
                         Ok(_) => {
-                            tracing::trace!(self.instance_id, "request handled successfully");
+                            tracing::trace!(instance_id = self.instance_id, "request handled successfully");
                         }
Likely an incorrect or invalid review comment.
lib/bindings/python/rust/engine.rs (4)

23-23: Import looks good.

The Instrument trait import is necessary for the .in_current_span() calls added to propagate tracing context.

152-152: Tracing instrumentation is correctly applied.

Using skip_all is appropriate here to avoid logging potentially large request objects.

187-187: Span propagation for blocking task is correct.

Properly propagates the tracing context into the blocking task that handles Python GIL operations.

266-267: Span propagation for async task is properly implemented.

Correctly propagates the tracing context into the spawned task that processes the Python async generator stream.

lib/runtime/src/http_server.rs (6)

17-17: Import additions are correct.

The TraceParent and Instrument imports are necessary for the distributed tracing implementation.

Also applies to: 29-29

126-127: Route handler updates are consistent.

Both health endpoints correctly pass the tracing context and route identifier for observability.

Also applies to: 133-134

140-140: Metrics handler update is correct.

Consistent with other handlers in accepting tracing context.

143-154: Fallback handler tracing implementation is well done.

Properly creates a traced span with all relevant identifiers from the TraceParent context.

190-195: Handler instrumentation is excellently implemented.

The tracing attribute properly skips the state while explicitly logging route and trace identifiers at the appropriate level.

226-231: Metrics handler instrumentation matches the established pattern.

Consistent implementation with health_handler for uniform observability.

lib/runtime/src/logging.rs

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Neelay Shah <[email protected]>

nnshah1 · 2025-07-24T23:27:42Z

Walkthrough

This set of changes introduces distributed tracing and enhanced observability across several Rust modules, particularly within the HTTP server, logging, and pipeline network components. New tracing instrumentation, context propagation, and structured logging are implemented, along with development dependency additions and minor builder enhancements. Test coverage for tracing and logging is also expanded.

Changes

File(s) Change Summary
lib/bindings/python/rust/engine.rs Added #[tracing::instrument(skip_all)] to the generate async method and instrumented spawned tasks for tracing context propagation in Python async generator stream handling.
lib/runtime/Cargo.toml Added stdio-override and jsonschema as development dependencies.
lib/runtime/src/component/endpoint.rs Appended .instance_id(lease_id) to the PushEndpoint builder chain in EndpointConfigBuilder::start.
lib/runtime/src/http_server.rs Integrated distributed tracing context propagation into HTTP routes and handlers. Updated handler signatures to accept tracing parameters, instrumented fallback handler, and expanded tests to validate tracing and logging.
lib/runtime/src/logging.rs Major enhancements: implemented distributed tracing support, W3C trace/span ID generation, DistributedTraceIdLayer for span context propagation, TraceParent extractor for HTTP, extended JSON log formatter with tracing fields and duration parsing, improved structured logging, and added comprehensive async tests for tracing and log schema validation.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs Added public instance_id: LeaseId field to PushEndpoint. Updated tracing logs to use instance_id instead of the removed worker_id variable.
lib/runtime/src/pipeline/network/ingress/push_handler.rs Added #[tracing::instrument(skip_all)] to the handle_payload async method for enhanced tracing instrumentation.

Sequence Diagram(s)
sequenceDiagram
    participant Client
    participant HTTPServer
    participant TraceParent
    participant Handler
    participant Logger

    Client->>HTTPServer: Sends HTTP request (with trace headers)
    HTTPServer->>TraceParent: Extracts trace context from headers
    HTTPServer->>Handler: Calls handler with trace context
    Handler->>Logger: Emits logs with trace/span IDs
    Handler-->>HTTPServer: Returns response
    HTTPServer-->>Client: Sends HTTP response
Loading
sequenceDiagram
    participant RustAsync
    participant Tracing
    participant Task

    RustAsync->>Tracing: Enters instrumented async function
    RustAsync->>Task: Spawns async/blocking task .in_current_span()
    Task->>Tracing: Executes within current tracing span
Loading
Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

In the garden of logs where the trace-ids bloom,
Spans now wander from server to room.
With context and color, each hop leaves a mark,
As rabbits observe in the code’s glowing dark.
JSON lines sparkle, the network's in tune—
Distributed carrots, harvested soon!
🥕✨

Note

⚡️ Unit Test Generation is now available in beta!
Learn more here, or try it out under "Finishing Touches" below.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

@coderabbitai summary

coderabbitai · 2025-07-24T23:30:19Z

Tip

For best results, initiate chat on the files or code changes.

@nnshah1 Here's a regenerated summary of this PR:

Summary

This pull request introduces comprehensive distributed tracing and structured logging enhancements to improve observability across the Dynamo system. The changes implement W3C-compliant trace context propagation, JSONL log formatting with microsecond precision timing, and span lifecycle tracking.

Key Enhancements

Distributed Tracing Infrastructure

W3C Trace Context Support: Implements standard 32-character trace IDs and 16-character span IDs with validation
Context Propagation: New TraceParent extractor for HTTP requests to parse traceparent and x-request-id headers
Span Lifecycle Management: Automatic tracking of span creation, duration, and closure with microsecond precision

Structured Logging Improvements

Enhanced JSON Output: Logs now include trace_id, span_id, parent_id, and x_request_id fields
Timing Metrics: Duration fields converted to microseconds (time.busy_us, time.idle_us, time.duration_us)
Schema Compliance: JSON Schema draft-07 validation with comprehensive async test coverage

HTTP Server Integration

Route Instrumentation: All HTTP handlers (/health, /live, /metrics) now accept trace context
Fallback Handler: Non-existent routes instrumented with tracing spans
Test Coverage: New integration tests validate tracing behavior across HTTP endpoints

Pipeline & Engine Instrumentation

Async Task Propagation: .in_current_span() applied to spawned tasks for context continuity
Python Bindings: Server streaming engine instrumented with #[tracing::instrument(skip_all)]
Push Endpoints: Instance ID tracking added to replace generic worker identifiers

Technical Implementation

The implementation centers around a new DistributedTraceIdLayer that automatically assigns or propagates trace identifiers across spans. The JSON log formatter has been significantly enhanced to normalize span lifecycle messages and parse tracing durations into standardized microsecond metrics.

Testing & Validation

Comprehensive test suite added including:

Nested span validation with parent-child relationships
JSON schema compliance verification
Duration consistency checks between related spans
HTTP endpoint tracing integration tests

This foundation enables future instrumentation of frontend components and request handling workflows, with potential for span creation metrics collection.

🧠 Learnings used

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

keivenchang · 2025-07-25T21:56:57Z

Thanks for adding this, looks good. How does one turn it on/off (via env var)? Also it would be nice to add a test to check the log output (at the "// TODO Add proper testing for... " line) , but since it's integration, it's not critical... maybe in the next PR.

nnshah1 · 2025-07-25T22:38:50Z

Thanks for adding this, looks good. How does one turn it on/off (via env var)? Also it would be nice to add a test to check the log output (at the "// TODO Add proper testing for... " line) , but since it's integration, it's not critical... maybe in the next PR.

we have a comment in the logging file on how to enable - but let's add that to the observability doc we start - we can add a section on logging

there is a test in the logging module - but not yet in the http service - ran into issues with capturing the output so need to work that out - manual inspection looks good though.

lib/runtime/src/http_server.rs

Co-authored-by: Olga Andreeva <[email protected]> Signed-off-by: Neelay Shah <[email protected]>

lib/runtime/src/logging.rs

Co-authored-by: Olga Andreeva <[email protected]> Signed-off-by: Neelay Shah <[email protected]>

lib/runtime/src/logging.rs

lib/runtime/src/http_server.rs

lib/runtime/src/logging.rs

…namo/dynamo into neelays/structured_logging

keivenchang

LGTM

nnshah1 added 11 commits July 8, 2025 16:11

updates

2d3a969

updates for testing

1be8e32

Merge remote-tracking branch 'origin/main' into neelays/structured_lo…

87e34f3

…gging

updted

1538c18

basic span id and trace id

dac8fa7

updated

0e28621

updating locally incremental

a1bfe2d

updated

38d5637

Merge branch 'main' into neelays/structured_logging

36b1d9d

incremental updates

78d32b9

Merge branch 'main' into neelays/structured_logging

74a6204

pull-request-size bot added the size/L label Jul 22, 2025

nnshah1 added 4 commits July 22, 2025 16:33

removing additional crates for now

80dbcf4

updating

9f42070

Merge branch 'main' into neelays/structured_logging

481fb67

updating - will add a few more tests

68eb52e

pull-request-size bot added size/XL and removed size/L labels Jul 24, 2025

nnshah1 added 2 commits July 24, 2025 13:41

updating with instrumentation in http server to show

cf29cb0

updates for clippy

093fb4e

nnshah1 changed the title ~~Neelays/structured logging~~ feat: updates to structured logging Jul 24, 2025

github-actions bot added the feat label Jul 24, 2025

nnshah1 marked this pull request as ready for review July 24, 2025 20:58

nnshah1 requested a review from a team as a code owner July 24, 2025 20:58

coderabbitai bot reviewed Jul 24, 2025

View reviewed changes

lib/runtime/src/logging.rs Show resolved Hide resolved

lib/runtime/src/logging.rs Show resolved Hide resolved

lib/runtime/src/logging.rs Outdated Show resolved Hide resolved

nnshah1 and others added 3 commits July 24, 2025 15:45

reverting instrumentation will add back in seperate PR

bb3f197

merge with main lock

2870637

Update lib/runtime/src/logging.rs

79728c1

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Neelay Shah <[email protected]>

nnshah1 added 6 commits July 24, 2025 16:30

updates

003174a

updated

1dc7a76

updating to remove unused imports

3f571de

updated to condense imports

ba40fe5

updated with fewer changes to imports

664bb18

change whitespace

9ca914b

nnshah1 enabled auto-merge (squash) July 25, 2025 23:29