Skip to content

Conversation

@nnshah1
Copy link
Contributor

@nnshah1 nnshah1 commented Jul 22, 2025

Overview:

Updates JSONL formats to include

  1. span start and close events
  2. inject trace id and span id
  3. enable capturing context to distribution
  4. reformat duration as microseconds
  5. Update http handlers to demonstrate implementation

Todo: Follow with instrumentation in frontend and request handling
Todo: Investigate if metrics can be collected easily on span creation

Example:

These insrumentations:

 #[tracing::instrument(
        skip_all,
        fields(
            span_id = "abd16e319329445f",
            trace_id = "2adfd24468724599bb9a4990dc342288"
        )
    )]
    async fn parent() {
        tracing::Span::current().record("trace_id", "invalid");
        tracing::Span::current().record("span_id", "invalid");
        tracing::Span::current().record("span_name", "invalid");
        tracing::trace!(message = "parent!");
        if let Some(my_ctx) = get_distributed_tracing_context() {
            tracing::info!(my_trace_id = my_ctx.trace_id);
        }
        child().await;
    }

    #[tracing::instrument(skip_all)]
    async fn child() {
        tracing::trace!(message = "child");
        if let Some(my_ctx) = get_distributed_tracing_context() {
            tracing::info!(my_trace_id = my_ctx.trace_id);
        }
        grandchild().await;
    }

    #[tracing::instrument(skip_all)]
    async fn grandchild() {
        tracing::trace!(message = "grandchild");
        if let Some(my_ctx) = get_distributed_tracing_context() {
            tracing::info!(my_trace_id = my_ctx.trace_id);
        }
    }

Generate these logs:

{"file":"lib/runtime/src/logging.rs","level":"INFO","line":727,"message":"SPAN_CREATED","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262300Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":740,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262362Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":745,"message":"SPAN_CREATED","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262408Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":749,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262420Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":754,"message":"SPAN_CREATED","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262432Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":758,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262441Z","trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":754,"message":"SPAN_CLOSED","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262451Z","time.busy_us":9,"time.duration_us":18,"time.idle_us":9,"trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":745,"message":"SPAN_CLOSED","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262915Z","time.busy_us":495,"time.duration_us":506,"time.idle_us":11,"trace_id":"2adfd24468724599bb9a4990dc342288"}
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":727,"message":"SPAN_CLOSED","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262932Z","time.busy_us":577,"time.duration_us":640,"time.idle_us":63,"trace_id":"2adfd24468724599bb9a4990dc342288"}

where log lines have the following schema:

 {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "title": "Runtime Log Line",
      "type": "object",
      "required": [
        "file",
        "level",
        "line",
        "message",
        "target",
        "time"
      ],
      "properties": {
        "file":      { "type": "string" },
        "level":     { "type": "string", "enum": ["ERROR", "WARN", "INFO", "DEBUG", "TRACE"] },
        "line":      { "type": "integer" },
        "message":   { "type": "string" },
        "target":    { "type": "string" },
        "time":      { "type": "string", "format": "date-time" },
        "span_id":   { "type": "string", "pattern": "^[a-f0-9]{16}$" },
        "parent_id": { "type": "string", "pattern": "^[a-f0-9]{16}$" },
        "trace_id":  { "type": "string", "pattern": "^[a-f0-9]{32}$" },
        "span_name": { "type": "string" },
        "time.busy_us":     { "type": "integer" },
        "time.duration_us": { "type": "integer" },
        "time.idle_us":     { "type": "integer" }
      },
      "additionalProperties": true
    }

Details:

Where should the reviewer start?

logging.rs and http_server.rs

Ignore others - will add as separate PR

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Integrated distributed tracing and context propagation throughout the HTTP server, including support for W3C-compliant trace and span IDs.
    • Enhanced JSON log output with distributed tracing fields and improved formatting, including microsecond precision and structured data handling.
    • Added public instance identifier to push endpoints for improved traceability.
  • Bug Fixes

    • Improved trace context handling and propagation for health and metrics endpoints.
  • Tests

    • Added comprehensive tests for distributed tracing, log schema validation, and HTTP endpoint traceability.
  • Chores

    • Added new development dependencies for testing and validation.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 22, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nnshah1 nnshah1 changed the title Neelays/structured logging feat: updates to structured logging Jul 24, 2025
@github-actions github-actions bot added the feat label Jul 24, 2025
@nnshah1 nnshah1 marked this pull request as ready for review July 24, 2025 20:58
@nnshah1 nnshah1 requested a review from a team as a code owner July 24, 2025 20:58
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 24, 2025

Walkthrough

This set of changes introduces distributed tracing and enhanced observability across several Rust modules, particularly within the HTTP server, logging, and pipeline network components. New tracing instrumentation, context propagation, and structured logging are implemented, along with development dependency additions and minor builder enhancements. Test coverage for tracing and logging is also expanded.

Changes

File(s) Change Summary
lib/bindings/python/rust/engine.rs Added #[tracing::instrument(skip_all)] to the generate async method and instrumented spawned tasks for tracing context propagation in Python async generator stream handling.
lib/runtime/Cargo.toml Added stdio-override and jsonschema as development dependencies.
lib/runtime/src/component/endpoint.rs Appended .instance_id(lease_id) to the PushEndpoint builder chain in EndpointConfigBuilder::start.
lib/runtime/src/http_server.rs Integrated distributed tracing context propagation into HTTP routes and handlers. Updated handler signatures to accept tracing parameters, instrumented fallback handler, and expanded tests to validate tracing and logging.
lib/runtime/src/logging.rs Major enhancements: implemented distributed tracing support, W3C trace/span ID generation, DistributedTraceIdLayer for span context propagation, TraceParent extractor for HTTP, extended JSON log formatter with tracing fields and duration parsing, improved structured logging, and added comprehensive async tests for tracing and log schema validation.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs Added public instance_id: LeaseId field to PushEndpoint. Updated tracing logs to use instance_id instead of the removed worker_id variable.
lib/runtime/src/pipeline/network/ingress/push_handler.rs Added #[tracing::instrument(skip_all)] to the handle_payload async method for enhanced tracing instrumentation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTPServer
    participant TraceParent
    participant Handler
    participant Logger

    Client->>HTTPServer: Sends HTTP request (with trace headers)
    HTTPServer->>TraceParent: Extracts trace context from headers
    HTTPServer->>Handler: Calls handler with trace context
    Handler->>Logger: Emits logs with trace/span IDs
    Handler-->>HTTPServer: Returns response
    HTTPServer-->>Client: Sends HTTP response
Loading
sequenceDiagram
    participant RustAsync
    participant Tracing
    participant Task

    RustAsync->>Tracing: Enters instrumented async function
    RustAsync->>Task: Spawns async/blocking task .in_current_span()
    Task->>Tracing: Executes within current tracing span
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

In the garden of logs where the trace-ids bloom,
Spans now wander from server to room.
With context and color, each hop leaves a mark,
As rabbits observe in the code’s glowing dark.
JSON lines sparkle, the network's in tune—
Distributed carrots, harvested soon!
🥕✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
lib/runtime/Cargo.toml (1)

73-74: Fix formatting inconsistency in dependency declarations.

The new dependencies lack spaces around = which is inconsistent with the rest of the file.

-stdio-override = {version= "0.2.0"}
-jsonschema = {version = "0.17"}
+stdio-override = { version = "0.2.0" }
+jsonschema = { version = "0.17" }
lib/runtime/src/http_server.rs (1)

420-479: Test implementation is incomplete.

The test sets up tracing but doesn't verify that trace IDs are actually propagated or logged correctly. The TODO comment on lines 439-440 acknowledges this gap.

Would you like me to help implement the trace ID verification logic? This could include:

  1. Capturing and parsing the JSONL logs
  2. Verifying trace_id and span_id fields are present and properly formatted
  3. Checking parent-child span relationships
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c642fd and 093fb4e.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • lib/bindings/python/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • lib/bindings/python/rust/engine.rs (4 hunks)
  • lib/runtime/Cargo.toml (1 hunks)
  • lib/runtime/src/component/endpoint.rs (1 hunks)
  • lib/runtime/src/http_server.rs (9 hunks)
  • lib/runtime/src/logging.rs (12 hunks)
  • lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2 hunks)
  • lib/runtime/src/pipeline/network/ingress/push_handler.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/runtime/Cargo.toml (1)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

lib/runtime/src/pipeline/network/ingress/push_handler.rs (1)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

lib/bindings/python/rust/engine.rs (5)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.

Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

Learnt from: t-ob
PR: #1290
File: launch/dynamo-run/src/subprocess/sglang_inc.py:80-110
Timestamp: 2025-06-03T10:17:51.711Z
Learning: The sglang async_encode method does not support streaming options, so collecting all embeddings before yielding is the correct approach for embedding requests.

lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2)

Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

lib/runtime/src/http_server.rs (3)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

lib/runtime/src/logging.rs (2)

Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

🔇 Additional comments (15)
lib/runtime/src/logging.rs (2)

645-654: LGTM! Smart approach to preserve structured data.

The JSON deserialization attempt for non-message fields is a clever way to preserve structured data in logs while keeping the message field as a plain string.


793-985: Excellent comprehensive test coverage!

The test thoroughly validates distributed tracing functionality including JSON schema compliance, trace propagation, span relationships, and timing consistency.

lib/runtime/src/component/endpoint.rs (1)

106-110: LGTM! Proper instance identification for tracing.

The addition of instance_id(lease_id) correctly passes the lease ID to the PushEndpoint for use in distributed tracing logs.

lib/runtime/src/pipeline/network/ingress/push_handler.rs (1)

25-26: LGTM! Appropriate tracing instrumentation.

Good use of skip_all to avoid recording potentially large or sensitive payload data in traces.

lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)

94-99: Fix field reference in tracing macros.

The instance_id field references are missing the self. prefix.

-                    tracing::trace!(self.instance_id, "handling new request");
+                    tracing::trace!(instance_id = self.instance_id, "handling new request");
                     let result = ingress.handle_payload(req.message.payload).await;
                     match result {
                         Ok(_) => {
-                            tracing::trace!(self.instance_id, "request handled successfully");
+                            tracing::trace!(instance_id = self.instance_id, "request handled successfully");
                         }

Likely an incorrect or invalid review comment.

lib/bindings/python/rust/engine.rs (4)

23-23: Import looks good.

The Instrument trait import is necessary for the .in_current_span() calls added to propagate tracing context.


152-152: Tracing instrumentation is correctly applied.

Using skip_all is appropriate here to avoid logging potentially large request objects.


187-187: Span propagation for blocking task is correct.

Properly propagates the tracing context into the blocking task that handles Python GIL operations.


266-267: Span propagation for async task is properly implemented.

Correctly propagates the tracing context into the spawned task that processes the Python async generator stream.

lib/runtime/src/http_server.rs (6)

17-17: Import additions are correct.

The TraceParent and Instrument imports are necessary for the distributed tracing implementation.

Also applies to: 29-29


126-127: Route handler updates are consistent.

Both health endpoints correctly pass the tracing context and route identifier for observability.

Also applies to: 133-134


140-140: Metrics handler update is correct.

Consistent with other handlers in accepting tracing context.


143-154: Fallback handler tracing implementation is well done.

Properly creates a traced span with all relevant identifiers from the TraceParent context.


190-195: Handler instrumentation is excellently implemented.

The tracing attribute properly skips the state while explicitly logging route and trace identifiers at the appropriate level.


226-231: Metrics handler instrumentation matches the established pattern.

Consistent implementation with health_handler for uniform observability.

nnshah1 and others added 3 commits July 24, 2025 15:45
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Neelay Shah <[email protected]>
@nnshah1
Copy link
Contributor Author

nnshah1 commented Jul 24, 2025

Walkthrough

This set of changes introduces distributed tracing and enhanced observability across several Rust modules, particularly within the HTTP server, logging, and pipeline network components. New tracing instrumentation, context propagation, and structured logging are implemented, along with development dependency additions and minor builder enhancements. Test coverage for tracing and logging is also expanded.

Changes

File(s) Change Summary
lib/bindings/python/rust/engine.rs Added #[tracing::instrument(skip_all)] to the generate async method and instrumented spawned tasks for tracing context propagation in Python async generator stream handling.
lib/runtime/Cargo.toml Added stdio-override and jsonschema as development dependencies.
lib/runtime/src/component/endpoint.rs Appended .instance_id(lease_id) to the PushEndpoint builder chain in EndpointConfigBuilder::start.
lib/runtime/src/http_server.rs Integrated distributed tracing context propagation into HTTP routes and handlers. Updated handler signatures to accept tracing parameters, instrumented fallback handler, and expanded tests to validate tracing and logging.
lib/runtime/src/logging.rs Major enhancements: implemented distributed tracing support, W3C trace/span ID generation, DistributedTraceIdLayer for span context propagation, TraceParent extractor for HTTP, extended JSON log formatter with tracing fields and duration parsing, improved structured logging, and added comprehensive async tests for tracing and log schema validation.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs Added public instance_id: LeaseId field to PushEndpoint. Updated tracing logs to use instance_id instead of the removed worker_id variable.
lib/runtime/src/pipeline/network/ingress/push_handler.rs Added #[tracing::instrument(skip_all)] to the handle_payload async method for enhanced tracing instrumentation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTPServer
    participant TraceParent
    participant Handler
    participant Logger

    Client->>HTTPServer: Sends HTTP request (with trace headers)
    HTTPServer->>TraceParent: Extracts trace context from headers
    HTTPServer->>Handler: Calls handler with trace context
    Handler->>Logger: Emits logs with trace/span IDs
    Handler-->>HTTPServer: Returns response
    HTTPServer-->>Client: Sends HTTP response
Loading
sequenceDiagram
    participant RustAsync
    participant Tracing
    participant Task

    RustAsync->>Tracing: Enters instrumented async function
    RustAsync->>Task: Spawns async/blocking task .in_current_span()
    Task->>Tracing: Executes within current tracing span
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

In the garden of logs where the trace-ids bloom,
Spans now wander from server to room.
With context and color, each hop leaves a mark,
As rabbits observe in the code’s glowing dark.
JSON lines sparkle, the network's in tune—
Distributed carrots, harvested soon!
🥕✨

Note

⚡️ Unit Test Generation is now available in beta!
Learn more here, or try it out under "Finishing Touches" below.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

@coderabbitai summary

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 24, 2025

Tip

For best results, initiate chat on the files or code changes.

@nnshah1 Here's a regenerated summary of this PR:

Summary

This pull request introduces comprehensive distributed tracing and structured logging enhancements to improve observability across the Dynamo system. The changes implement W3C-compliant trace context propagation, JSONL log formatting with microsecond precision timing, and span lifecycle tracking.

Key Enhancements

Distributed Tracing Infrastructure

  • W3C Trace Context Support: Implements standard 32-character trace IDs and 16-character span IDs with validation
  • Context Propagation: New TraceParent extractor for HTTP requests to parse traceparent and x-request-id headers
  • Span Lifecycle Management: Automatic tracking of span creation, duration, and closure with microsecond precision

Structured Logging Improvements

  • Enhanced JSON Output: Logs now include trace_id, span_id, parent_id, and x_request_id fields
  • Timing Metrics: Duration fields converted to microseconds (time.busy_us, time.idle_us, time.duration_us)
  • Schema Compliance: JSON Schema draft-07 validation with comprehensive async test coverage

HTTP Server Integration

  • Route Instrumentation: All HTTP handlers (/health, /live, /metrics) now accept trace context
  • Fallback Handler: Non-existent routes instrumented with tracing spans
  • Test Coverage: New integration tests validate tracing behavior across HTTP endpoints

Pipeline & Engine Instrumentation

  • Async Task Propagation: .in_current_span() applied to spawned tasks for context continuity
  • Python Bindings: Server streaming engine instrumented with #[tracing::instrument(skip_all)]
  • Push Endpoints: Instance ID tracking added to replace generic worker identifiers

Technical Implementation

The implementation centers around a new DistributedTraceIdLayer that automatically assigns or propagates trace identifiers across spans. The JSON log formatter has been significantly enhanced to normalize span lifecycle messages and parse tracing durations into standardized microsecond metrics.

Testing & Validation

Comprehensive test suite added including:

  • Nested span validation with parent-child relationships
  • JSON schema compliance verification
  • Duration consistency checks between related spans
  • HTTP endpoint tracing integration tests

This foundation enables future instrumentation of frontend components and request handling workflows, with potential for span creation metrics collection.


🧠 Learnings used
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

@keivenchang
Copy link
Contributor

keivenchang commented Jul 25, 2025

Thanks for adding this, looks good. How does one turn it on/off (via env var)? Also it would be nice to add a test to check the log output (at the "// TODO Add proper testing for... " line) , but since it's integration, it's not critical... maybe in the next PR.

@nnshah1
Copy link
Contributor Author

nnshah1 commented Jul 25, 2025

Thanks for adding this, looks good. How does one turn it on/off (via env var)? Also it would be nice to add a test to check the log output (at the "// TODO Add proper testing for... " line) , but since it's integration, it's not critical... maybe in the next PR.

we have a comment in the logging file on how to enable - but let's add that to the observability doc we start - we can add a section on logging

there is a test in the logging module - but not yet in the http service - ran into issues with capturing the output so need to work that out - manual inspection looks good though.

@nnshah1 nnshah1 enabled auto-merge (squash) July 25, 2025 23:29
Co-authored-by: Olga Andreeva <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
nnshah1 and others added 2 commits July 25, 2025 16:49
Co-authored-by: Olga Andreeva <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Co-authored-by: Olga Andreeva <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
@nnshah1 nnshah1 requested a review from oandreeva-nv July 26, 2025 01:41
Copy link
Contributor

@keivenchang keivenchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nnshah1 nnshah1 merged commit 0cb01b3 into main Jul 28, 2025
10 checks passed
@nnshah1 nnshah1 deleted the neelays/structured_logging branch July 28, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants