-
Notifications
You must be signed in to change notification settings - Fork 639
feat: add more metrics to rust frontend #1315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces new optional token-related metadata fields— Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTP_Service
participant Preprocessor
participant Engine
participant Metrics
Client->>HTTP_Service: Sends completion/chat request
HTTP_Service->>Preprocessor: Forwards request
Preprocessor->>Engine: Sends processed request
Engine-->>Preprocessor: Streams Annotated response chunks (with token fields)
Preprocessor-->>HTTP_Service: Streams Annotated responses (with token fields)
HTTP_Service->>Metrics: Updates TTFT, ITL, input/output token metrics
HTTP_Service-->>Client: Streams SSE events
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (4)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
lib/llm/src/http/service/metrics.rs (1)
128-177
: Review bucket configurations for production appropriateness.The histogram bucket configurations look reasonable, but consider if they align with expected token ranges and latencies in your production environment.
Consider validating these bucket ranges against expected production metrics:
- ISL buckets: 0-128K tokens might be appropriate for most use cases
- OSL buckets: 0-32K tokens seems reasonable for typical responses
- TTFT buckets: 0-480 seconds covers a wide range
- ITL buckets: 0-2 seconds should capture most inter-token delays
You may want to adjust based on your specific model performance characteristics.
lib/llm/src/http/service/openai.rs (1)
447-489
: Consider potential performance impact of frequent mutex locking.The
process_event_converter
function correctly handles event conversion and metrics updates, but consider the performance impact of locking the mutex on every event.If performance becomes an issue with high-throughput streams, consider:
- Batching metrics updates
- Using lock-free atomic operations for simple counters
- Measuring the actual impact before optimizing
The current implementation prioritizes correctness and is likely acceptable for most use cases.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
lib/engines/mistralrs/src/lib.rs
(2 hunks)lib/llm/src/engines.rs
(2 hunks)lib/llm/src/http/service/metrics.rs
(7 hunks)lib/llm/src/http/service/openai.rs
(10 hunks)lib/llm/src/preprocessor.rs
(3 hunks)lib/llm/src/protocols/codec.rs
(1 hunks)lib/llm/src/protocols/openai.rs
(1 hunks)lib/llm/src/protocols/openai/chat_completions/aggregator.rs
(2 hunks)lib/llm/src/protocols/openai/chat_completions/delta.rs
(1 hunks)lib/llm/src/protocols/openai/completions/aggregator.rs
(2 hunks)lib/llm/src/protocols/openai/completions/delta.rs
(1 hunks)lib/runtime/src/protocols/annotated.rs
(6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
lib/llm/src/protocols/openai.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)
get_isl
(216-218)lib/llm/src/protocols/openai/completions/delta.rs (1)
get_isl
(131-133)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)
get_isl
(216-218)lib/llm/src/protocols/openai.rs (1)
get_isl
(312-312)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai/completions/delta.rs (1)
get_isl
(131-133)lib/llm/src/protocols/openai.rs (1)
get_isl
(312-312)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (26)
lib/llm/src/protocols/codec.rs (1)
121-123
: LGTM! Clean token metadata infrastructure addition.The addition of the three token-related fields (
chunk_tokens
,input_tokens
,output_tokens
) initialized toNone
is consistent with the broader effort to add stream mode metrics. This non-breaking change properly extends theAnnotated
struct for token tracking.lib/llm/src/protocols/openai/chat_completions/aggregator.rs (1)
287-289
: LGTM! Test cases properly updated for new token fields.The test helper functions correctly initialize the new token metadata fields to
None
, maintaining consistency with the production code changes. This ensures proper test coverage for the enhancedAnnotated
struct.Also applies to: 433-435
lib/llm/src/protocols/openai.rs (1)
311-312
: LGTM! Well-designed trait extension for ISL metrics.The addition of
get_isl()
method to theDeltaGeneratorExt
trait provides a clean interface for accessing Input Sequence Length metrics. The method signature and documentation are clear, and based on the relevant code snippets, implementations correctly returnprompt_tokens
from usage data.lib/engines/mistralrs/src/lib.rs (1)
410-412
: LGTM! Engine implementation consistently updated for token tracking.Both the chat completion and text completion generation paths correctly include the new token metadata fields initialized to
None
. This maintains consistency across the engine implementations and prepares the infrastructure for token metrics collection in the processing pipeline.Also applies to: 572-574
lib/llm/src/protocols/openai/completions/aggregator.rs (1)
208-210
: LGTM! Consistent test data structure updates.The addition of the new optional token-related fields (
chunk_tokens
,input_tokens
,output_tokens
) initialized toNone
in test data structures is consistent with the updatedAnnotated
struct definition. This ensures test compatibility without introducing any functional changes.Also applies to: 320-322
lib/llm/src/engines.rs (1)
205-205
: LGTM! Consistent struct initialization in echo engine.The explicit initialization of the new token-related fields to
None
is appropriate for theEchoEngineFull
implementation, which simulates responses without tracking actual token metrics. This maintains consistency with the updatedAnnotated
struct definition.Also applies to: 213-213, 237-237, 241-241
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)
216-218
: LGTM! Clean implementation of ISL metric retrieval.The
get_isl
method implementation correctly returns the prompt tokens count as the input sequence length (ISL) metric. The implementation is clean and aligns with the trait definition inlib/llm/src/protocols/openai.rs
.lib/runtime/src/protocols/annotated.rs (6)
40-44
: LGTM! Well-structured token metadata addition.The three new optional token fields are properly defined with appropriate serde attributes for optional serialization.
56-58
: LGTM! Consistent field initialization.The new token fields are properly initialized to
None
in thefrom_error
constructor, maintaining consistency with the struct design.
69-71
: LGTM! Consistent field initialization.The new token fields are properly initialized to
None
in thefrom_data
constructor.
87-89
: LGTM! Consistent field initialization.The new token fields are properly initialized to
None
in thefrom_annotation
constructor.
125-127
: LGTM! Proper field propagation.The
transfer
method correctly propagates the token fields from the source to the targetAnnotated
instance.
143-145
: LGTM! Proper field propagation in transformation.The
map_data
method correctly propagates the token fields during data transformation, ensuring metadata is preserved.lib/llm/src/preprocessor.rs (4)
196-196
: LGTM! Appropriate field addition for token tracking.The
cumulative_output_tokens
field is correctly added to track output tokens across response chunks.
204-204
: LGTM! Proper field initialization.The
cumulative_output_tokens
field is correctly initialized to zero.
254-256
: LGTM! Proper token metadata attachment.The token counts are correctly attached to the response, providing comprehensive token usage information.
225-236
:✅ Verification successful
Verify the
get_isl()
method availability.The token calculation logic looks correct, but ensure that the
get_isl()
method is properly implemented in theDeltaGeneratorExt
trait.
🏁 Script executed:
#!/bin/bash # Description: Verify that get_isl() method exists in DeltaGeneratorExt trait # Expected: Find the method definition and implementations # Search for the get_isl method definition in the trait ast-grep --pattern $'trait DeltaGeneratorExt<$_> { $$$ fn get_isl($_) -> $_ { $$$ } $$$ }' # Also search for get_isl method implementations rg -A 3 "fn get_isl"Length of output: 913
get_isl() implementation verified
- The
DeltaGeneratorExt
trait inlib/llm/src/protocols/openai.rs
declaresfn get_isl(&self) -> Option<u32>;
- Concrete implementations exist in:
lib/llm/src/protocols/openai/completions/delta.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs
- No missing or unimplemented usages of
get_isl()
were found.All checks pass—no further changes needed.
lib/llm/src/http/service/metrics.rs (4)
31-34
: LGTM! Comprehensive token metrics addition.The four new histogram metrics provide excellent coverage for token-related performance monitoring: ISL, OSL, TTFT, and ITL.
48-50
: LGTM! Appropriate fields for token tracking.The new fields in
InflightGuard
properly support token metrics calculation and timing.
299-301
: LGTM! Proper field initialization.The new fields in
InflightGuard
are correctly initialized with appropriate default values.
317-352
: LGTM! Well-implemented metrics observation methods.The
observe_response
method properly handles TTFT and ITL calculations, with appropriate logic for first token detection and per-token ITL recording.lib/llm/src/http/service/openai.rs (5)
156-160
: LGTM! Proper thread-safe wrapping of InflightGuard.The
Arc<Mutex<_>>
wrapping enables safe sharing of the inflight guard across async tasks for metrics updates.
180-182
: LGTM! Efficient stream processing with metrics.The stream mapping correctly uses the new
process_event_converter
function to handle both event conversion and metrics updates.
277-281
: LGTM! Consistent thread-safe implementation.The chat completions handler uses the same thread-safe pattern as the completions handler.
303-305
: LGTM! Consistent stream processing.The chat completions stream processing follows the same pattern as completions.
469-480
: LGTM! Proper conditional metrics updates.The metrics updates are correctly gated by the presence of token data, ensuring metrics are only recorded when relevant information is available.
Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Hongkuan Zhou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
lib/llm/src/http/service/metrics.rs (2)
313-313
: Address model name sanitization concern.This line records metrics using the raw model name. As mentioned in past reviews, consider sanitizing the model name to prevent issues with metric labels.
Apply sanitization to the model name before using it in metric labels:
- .with_label_values(&[&self.model]) + .with_label_values(&[&self.model.to_lowercase().replace('-', "_")])
335-335
: TODO comment matches previous review feedback.This TODO about publishing ISL as soon as tokenization completes aligns with previous review feedback. The current implementation publishes ISL with the first token response, which may not be optimal for timing accuracy.
Would you like me to help design a solution to publish ISL earlier in the pipeline, such as after request preprocessing?
🧹 Nitpick comments (2)
lib/llm/src/http/service/metrics.rs (2)
48-54
: Good field documentation, but consider boolean naming convention.The comments clearly explain the purpose of each field. However, based on past feedback about boolean naming, consider renaming
is_first_token
tohas_first_token
for clarity.- // we use is_first_token to distinguish TTFT from ITL. It is false by default and - // flip to true when the first token is returned and TTFT is published. - is_first_token: bool, + // we use has_first_token to distinguish TTFT from ITL. It is true by default and + // flips to false when the first token is returned and TTFT is published. + has_first_token: bool,
346-352
: ITL calculation logic is correct but could be optimized.The logic correctly calculates inter-token latency and records it once per token. However, the loop could be replaced with a single observation with the count parameter if Prometheus supports it.
Consider if this can be optimized (depending on Prometheus library capabilities):
- for _ in 0..num_tokens { - self.metrics - .inter_token_latency - .with_label_values(&[&self.model]) - .observe(itl); - } + let histogram = self.metrics + .inter_token_latency + .with_label_values(&[&self.model]); + for _ in 0..num_tokens { + histogram.observe(itl); + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/llm/src/http/service/metrics.rs
(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (3)
lib/llm/src/http/service/metrics.rs (3)
6-9
: LGTM: Import additions are appropriate.The additional imports for
Duration
andInstant
are necessary for the new timing functionality.
31-34
: LGTM: New metric fields follow established patterns.The four new histogram fields are well-named and consistent with existing metrics structure.
132-180
: Verify histogram bucket ranges for production workloads.The bucket ranges look reasonable, but please confirm they align with expected production workloads:
- ISL buckets go up to 128k tokens (good for long context models)
- OSL buckets max at 32k tokens
- TTFT buckets range from 1ms to 8 minutes
- ITL buckets range from 1ms to 2 seconds
Consider validating these ranges against your typical workload patterns to ensure good histogram resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/llm/src/http/service/metrics.rs (1)
434-446
: ITL calculation methodology is reasonable but consider documenting the averaging approach.The implementation calculates average inter-token latency across all tokens in a response chunk, which is a practical approximation given the available timing data. However, this approach may not capture true per-token latency variations within a chunk.
Consider adding a comment to clarify this design choice:
+ // Calculate average ITL for all tokens in this response chunk + // This is an approximation since individual token timing isn't available let response_duration = current_duration - last_response_time; let itl = response_duration.as_secs_f64() / num_tokens as f64;
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/llm/src/http/service/metrics.rs
(8 hunks)lib/llm/src/http/service/openai.rs
(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/llm/src/http/service/openai.rs
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (7)
lib/llm/src/http/service/metrics.rs (7)
6-9
: LGTM! Proper imports for timing functionality.The additional imports for
Duration
andInstant
are correctly added to support the new timing-based metrics.
31-34
: LGTM! New histogram metrics properly declared.The four new histogram metrics are correctly added to the
Metrics
struct for tracking input/output sequence lengths and timing metrics.
78-90
: Well-designed ResponseMetricCollector struct.The struct properly encapsulates all necessary state for tracking response metrics:
- Uses
Arc<Metrics>
for thread-safe sharing- Tracks timing state with
start_time
andlast_response_time
- Uses boolean flag to distinguish first token (TTFT) vs subsequent tokens (ITL)
- Maintains running count of output sequence length
139-187
: Excellent bucket configurations for different metric types.The histogram buckets are well-chosen for each metric:
- Input/output sequence length: Exponential buckets from 0 to 128K tokens covering typical use cases
- TTFT: Fine-grained buckets for sub-second measurements up to 8 minutes
- ITL: Very fine-grained buckets for measuring inter-token delays in milliseconds
The bucket ranges align well with expected LLM performance characteristics.
285-296
: Good integration of model name sanitization.The
to_lowercase()
conversion addresses the previous review comment about model name sanitization and ensures consistent metric labeling.
407-430
: Robust implementation with proper edge case handling.The method correctly:
- Guards against division by zero with early return
- Distinguishes first token (TTFT) from subsequent tokens (ITL)
- Publishes ISL only once on first token
- Includes the TODO comment as requested in previous reviews
449-457
: Proper RAII pattern for final metric publication.The
Drop
implementation ensures OSL is always published when the collector is destroyed, preventing metric loss even if the response processing is interrupted.
Add these metrics to rust frontend (stream mode only):
Special thanks to @jthomson04 for the help in rust implementation!
Summary by CodeRabbit