-
Notifications
You must be signed in to change notification settings - Fork 693
feat: updates to structured logging #2061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis set of changes introduces distributed tracing and enhanced observability across several Rust modules, particularly within the HTTP server, logging, and pipeline network components. New tracing instrumentation, context propagation, and structured logging are implemented, along with development dependency additions and minor builder enhancements. Test coverage for tracing and logging is also expanded. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTPServer
participant TraceParent
participant Handler
participant Logger
Client->>HTTPServer: Sends HTTP request (with trace headers)
HTTPServer->>TraceParent: Extracts trace context from headers
HTTPServer->>Handler: Calls handler with trace context
Handler->>Logger: Emits logs with trace/span IDs
Handler-->>HTTPServer: Returns response
HTTPServer-->>Client: Sends HTTP response
sequenceDiagram
participant RustAsync
participant Tracing
participant Task
RustAsync->>Tracing: Enters instrumented async function
RustAsync->>Task: Spawns async/blocking task .in_current_span()
Task->>Tracing: Executes within current tracing span
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
lib/runtime/Cargo.toml (1)
73-74: Fix formatting inconsistency in dependency declarations.The new dependencies lack spaces around
=which is inconsistent with the rest of the file.-stdio-override = {version= "0.2.0"} -jsonschema = {version = "0.17"} +stdio-override = { version = "0.2.0" } +jsonschema = { version = "0.17" }lib/runtime/src/http_server.rs (1)
420-479: Test implementation is incomplete.The test sets up tracing but doesn't verify that trace IDs are actually propagated or logged correctly. The TODO comment on lines 439-440 acknowledges this gap.
Would you like me to help implement the trace ID verification logic? This could include:
- Capturing and parsing the JSONL logs
- Verifying trace_id and span_id fields are present and properly formatted
- Checking parent-child span relationships
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locklib/bindings/python/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
lib/bindings/python/rust/engine.rs(4 hunks)lib/runtime/Cargo.toml(1 hunks)lib/runtime/src/component/endpoint.rs(1 hunks)lib/runtime/src/http_server.rs(9 hunks)lib/runtime/src/logging.rs(12 hunks)lib/runtime/src/pipeline/network/ingress/push_endpoint.rs(2 hunks)lib/runtime/src/pipeline/network/ingress/push_handler.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/runtime/Cargo.toml (1)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/runtime/src/pipeline/network/ingress/push_handler.rs (1)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/bindings/python/rust/engine.rs (5)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: oandreeva-nv
PR: #1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets None. The start_batching_publisher function in lib/llm/tests/block_manager.rs demonstrates this pattern: when the KVBMDynamoRuntimeComponent is dropped, its batch_tx sender is dropped, causing rx.recv() to return None, which triggers cleanup and task termination.
Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.
Learnt from: PeaBrane
PR: #1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Learnt from: t-ob
PR: #1290
File: launch/dynamo-run/src/subprocess/sglang_inc.py:80-110
Timestamp: 2025-06-03T10:17:51.711Z
Learning: The sglang async_encode method does not support streaming options, so collecting all embeddings before yielding is the correct approach for embedding requests.
lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (2)
Learnt from: PeaBrane
PR: #1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/runtime/src/http_server.rs (3)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
lib/runtime/src/logging.rs (2)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
🔇 Additional comments (15)
lib/runtime/src/logging.rs (2)
645-654: LGTM! Smart approach to preserve structured data.The JSON deserialization attempt for non-message fields is a clever way to preserve structured data in logs while keeping the message field as a plain string.
793-985: Excellent comprehensive test coverage!The test thoroughly validates distributed tracing functionality including JSON schema compliance, trace propagation, span relationships, and timing consistency.
lib/runtime/src/component/endpoint.rs (1)
106-110: LGTM! Proper instance identification for tracing.The addition of
instance_id(lease_id)correctly passes the lease ID to the PushEndpoint for use in distributed tracing logs.lib/runtime/src/pipeline/network/ingress/push_handler.rs (1)
25-26: LGTM! Appropriate tracing instrumentation.Good use of
skip_allto avoid recording potentially large or sensitive payload data in traces.lib/runtime/src/pipeline/network/ingress/push_endpoint.rs (1)
94-99: Fix field reference in tracing macros.The
instance_idfield references are missing theself.prefix.- tracing::trace!(self.instance_id, "handling new request"); + tracing::trace!(instance_id = self.instance_id, "handling new request"); let result = ingress.handle_payload(req.message.payload).await; match result { Ok(_) => { - tracing::trace!(self.instance_id, "request handled successfully"); + tracing::trace!(instance_id = self.instance_id, "request handled successfully"); }Likely an incorrect or invalid review comment.
lib/bindings/python/rust/engine.rs (4)
23-23: Import looks good.The
Instrumenttrait import is necessary for the.in_current_span()calls added to propagate tracing context.
152-152: Tracing instrumentation is correctly applied.Using
skip_allis appropriate here to avoid logging potentially large request objects.
187-187: Span propagation for blocking task is correct.Properly propagates the tracing context into the blocking task that handles Python GIL operations.
266-267: Span propagation for async task is properly implemented.Correctly propagates the tracing context into the spawned task that processes the Python async generator stream.
lib/runtime/src/http_server.rs (6)
17-17: Import additions are correct.The
TraceParentandInstrumentimports are necessary for the distributed tracing implementation.Also applies to: 29-29
126-127: Route handler updates are consistent.Both health endpoints correctly pass the tracing context and route identifier for observability.
Also applies to: 133-134
140-140: Metrics handler update is correct.Consistent with other handlers in accepting tracing context.
143-154: Fallback handler tracing implementation is well done.Properly creates a traced span with all relevant identifiers from the TraceParent context.
190-195: Handler instrumentation is excellently implemented.The tracing attribute properly skips the state while explicitly logging route and trace identifiers at the appropriate level.
226-231: Metrics handler instrumentation matches the established pattern.Consistent implementation with health_handler for uniform observability.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Neelay Shah <[email protected]>
@coderabbitai summary |
|
Tip For best results, initiate chat on the files or code changes. @nnshah1 Here's a regenerated summary of this PR: SummaryThis pull request introduces comprehensive distributed tracing and structured logging enhancements to improve observability across the Dynamo system. The changes implement W3C-compliant trace context propagation, JSONL log formatting with microsecond precision timing, and span lifecycle tracking. Key EnhancementsDistributed Tracing Infrastructure
Structured Logging Improvements
HTTP Server Integration
Pipeline & Engine Instrumentation
Technical ImplementationThe implementation centers around a new Testing & ValidationComprehensive test suite added including:
This foundation enables future instrumentation of frontend components and request handling workflows, with potential for span creation metrics collection. 🧠 Learnings used |
|
Thanks for adding this, looks good. How does one turn it on/off (via env var)? Also it would be nice to add a test to check the log output (at the "// TODO Add proper testing for... " line) , but since it's integration, it's not critical... maybe in the next PR. |
we have a comment in the logging file on how to enable - but let's add that to the observability doc we start - we can add a section on logging there is a test in the logging module - but not yet in the http service - ran into issues with capturing the output so need to work that out - manual inspection looks good though. |
Co-authored-by: Olga Andreeva <[email protected]> Signed-off-by: Neelay Shah <[email protected]>
Co-authored-by: Olga Andreeva <[email protected]> Signed-off-by: Neelay Shah <[email protected]>
Co-authored-by: Olga Andreeva <[email protected]> Signed-off-by: Neelay Shah <[email protected]>
keivenchang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Overview:
Updates JSONL formats to include
Todo: Follow with instrumentation in frontend and request handling
Todo: Investigate if metrics can be collected easily on span creation
Example:
These insrumentations:
Generate these logs:
{"file":"lib/runtime/src/logging.rs","level":"INFO","line":727,"message":"SPAN_CREATED","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262300Z","trace_id":"2adfd24468724599bb9a4990dc342288"} {"file":"lib/runtime/src/logging.rs","level":"INFO","line":740,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262362Z","trace_id":"2adfd24468724599bb9a4990dc342288"} {"file":"lib/runtime/src/logging.rs","level":"INFO","line":745,"message":"SPAN_CREATED","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262408Z","trace_id":"2adfd24468724599bb9a4990dc342288"} {"file":"lib/runtime/src/logging.rs","level":"INFO","line":749,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262420Z","trace_id":"2adfd24468724599bb9a4990dc342288"} {"file":"lib/runtime/src/logging.rs","level":"INFO","line":754,"message":"SPAN_CREATED","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262432Z","trace_id":"2adfd24468724599bb9a4990dc342288"} {"file":"lib/runtime/src/logging.rs","level":"INFO","line":758,"message":"","my_trace_id":"2adfd24468724599bb9a4990dc342288","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262441Z","trace_id":"2adfd24468724599bb9a4990dc342288"} {"file":"lib/runtime/src/logging.rs","level":"INFO","line":754,"message":"SPAN_CLOSED","parent_id":"ee39eaf09c814019","span_id":"659eaedf97104d1a","span_name":"grandchild","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262451Z","time.busy_us":9,"time.duration_us":18,"time.idle_us":9,"trace_id":"2adfd24468724599bb9a4990dc342288"} {"file":"lib/runtime/src/logging.rs","level":"INFO","line":745,"message":"SPAN_CLOSED","parent_id":"abd16e319329445f","span_id":"ee39eaf09c814019","span_name":"child","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262915Z","time.busy_us":495,"time.duration_us":506,"time.idle_us":11,"trace_id":"2adfd24468724599bb9a4990dc342288"} {"file":"lib/runtime/src/logging.rs","level":"INFO","line":727,"message":"SPAN_CLOSED","span_id":"abd16e319329445f","span_name":"parent","target":"dynamo_runtime::logging::tests","time":"2025-07-24T20:46:46.262932Z","time.busy_us":577,"time.duration_us":640,"time.idle_us":63,"trace_id":"2adfd24468724599bb9a4990dc342288"}where log lines have the following schema:
{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Runtime Log Line", "type": "object", "required": [ "file", "level", "line", "message", "target", "time" ], "properties": { "file": { "type": "string" }, "level": { "type": "string", "enum": ["ERROR", "WARN", "INFO", "DEBUG", "TRACE"] }, "line": { "type": "integer" }, "message": { "type": "string" }, "target": { "type": "string" }, "time": { "type": "string", "format": "date-time" }, "span_id": { "type": "string", "pattern": "^[a-f0-9]{16}$" }, "parent_id": { "type": "string", "pattern": "^[a-f0-9]{16}$" }, "trace_id": { "type": "string", "pattern": "^[a-f0-9]{32}$" }, "span_name": { "type": "string" }, "time.busy_us": { "type": "integer" }, "time.duration_us": { "type": "integer" }, "time.idle_us": { "type": "integer" } }, "additionalProperties": true }Details:
Where should the reviewer start?
logging.rs and http_server.rs
Ignore others - will add as separate PR
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores