Skip to content

Conversation

tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented May 31, 2025

Add these metrics to rust frontend (stream mode only):

  • ISL
  • OSL
  • TTFT
  • ITL

Special thanks to @jthomson04 for the help in rust implementation!

Summary by CodeRabbit

  • New Features
    • Added detailed token usage metrics to streaming responses, including per-chunk tokens, input tokens, and cumulative output tokens.
    • Enhanced API responses to include richer metadata about token counts for better usage insights.
    • Introduced new metrics for input/output sequence length, time to first token, and inter-token latency, improving observability and monitoring.
    • Added input sequence length retrieval method to token generators for improved token count tracking.
    • Extended annotated response structures with new optional token count fields.
    • Improved event processing in streaming endpoints to update response metrics in real time.
  • Bug Fixes
    • Improved thread safety and reliability in metrics tracking during streaming responses.
  • Documentation
    • Updated API to reflect new optional token-related fields in responses.

Copy link

copy-pr-bot bot commented May 31, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

coderabbitai bot commented May 31, 2025

Walkthrough

This update introduces new optional token-related metadata fields—chunk_tokens, input_tokens, and output_tokens—to the Annotated struct and propagates them throughout the LLM pipeline, including engines, protocol codecs, preprocessors, and HTTP service metrics. It also adds new Prometheus metrics and thread-safe handling for inflight metric tracking, with event processing updated to record token-level metrics.

Changes

File(s) Change Summary
lib/runtime/src/protocols/annotated.rs Added chunk_tokens, input_tokens, output_tokens fields to Annotated<R>; updated constructors and methods to handle them.
lib/llm/src/protocols/codec.rs Set new token fields to None in MessageAnnotated<T> conversion.
lib/llm/src/protocols/openai.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs
lib/llm/src/protocols/openai/completions/delta.rs
Added get_isl() method to DeltaGeneratorExt trait and its implementations for prompt token count retrieval.
lib/llm/src/protocols/openai/chat_completions/aggregator.rs
lib/llm/src/protocols/openai/completions/aggregator.rs
Added new token fields to test data in Annotated instances.
lib/engines/mistralrs/src/lib.rs
lib/llm/src/engines.rs
Set new token fields to None in Annotated construction in engine implementations.
lib/llm/src/preprocessor.rs Tracks per-chunk and cumulative output tokens in OpenAIPreprocessor; attaches token metrics to responses.
lib/llm/src/http/service/metrics.rs Added Prometheus histograms for input/output sequence length, TTFT, and ITL; added ResponseMetricCollector for detailed tracking; updated InflightGuard to lowercase model names.
lib/llm/src/http/service/openai.rs Refactored event processing to use process_event_converter function that updates response metrics; replaced TryFrom<EventConverter>; made InflightGuard mutable and thread-safe.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP_Service
    participant Preprocessor
    participant Engine
    participant Metrics

    Client->>HTTP_Service: Sends completion/chat request
    HTTP_Service->>Preprocessor: Forwards request
    Preprocessor->>Engine: Sends processed request
    Engine-->>Preprocessor: Streams Annotated response chunks (with token fields)
    Preprocessor-->>HTTP_Service: Streams Annotated responses (with token fields)
    HTTP_Service->>Metrics: Updates TTFT, ITL, input/output token metrics
    HTTP_Service-->>Client: Streams SSE events
Loading

Poem

In fields of code where tokens grow,
Three new seeds now softly sow—
Chunk, input, output, counted true,
Metrics bloom where once were few.
Thread-safe bunnies guard the gate,
Prometheus watches, annotate!
Hop along, the data flows anew.

((\
( -.-)
o_(")(")


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1dc774 and 664bb54.

📒 Files selected for processing (1)
  • lib/llm/src/http/service/openai.rs (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/llm/src/http/service/openai.rs
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: Build and Test - vllm

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
lib/llm/src/http/service/metrics.rs (1)

128-177: Review bucket configurations for production appropriateness.

The histogram bucket configurations look reasonable, but consider if they align with expected token ranges and latencies in your production environment.

Consider validating these bucket ranges against expected production metrics:

  • ISL buckets: 0-128K tokens might be appropriate for most use cases
  • OSL buckets: 0-32K tokens seems reasonable for typical responses
  • TTFT buckets: 0-480 seconds covers a wide range
  • ITL buckets: 0-2 seconds should capture most inter-token delays

You may want to adjust based on your specific model performance characteristics.

lib/llm/src/http/service/openai.rs (1)

447-489: Consider potential performance impact of frequent mutex locking.

The process_event_converter function correctly handles event conversion and metrics updates, but consider the performance impact of locking the mutex on every event.

If performance becomes an issue with high-throughput streams, consider:

  • Batching metrics updates
  • Using lock-free atomic operations for simple counters
  • Measuring the actual impact before optimizing

The current implementation prioritizes correctness and is likely acceptable for most use cases.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c939da0 and a0d2711.

📒 Files selected for processing (12)
  • lib/engines/mistralrs/src/lib.rs (2 hunks)
  • lib/llm/src/engines.rs (2 hunks)
  • lib/llm/src/http/service/metrics.rs (7 hunks)
  • lib/llm/src/http/service/openai.rs (10 hunks)
  • lib/llm/src/preprocessor.rs (3 hunks)
  • lib/llm/src/protocols/codec.rs (1 hunks)
  • lib/llm/src/protocols/openai.rs (1 hunks)
  • lib/llm/src/protocols/openai/chat_completions/aggregator.rs (2 hunks)
  • lib/llm/src/protocols/openai/chat_completions/delta.rs (1 hunks)
  • lib/llm/src/protocols/openai/completions/aggregator.rs (2 hunks)
  • lib/llm/src/protocols/openai/completions/delta.rs (1 hunks)
  • lib/runtime/src/protocols/annotated.rs (6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
lib/llm/src/protocols/openai.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)
  • get_isl (216-218)
lib/llm/src/protocols/openai/completions/delta.rs (1)
  • get_isl (131-133)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (1)
  • get_isl (216-218)
lib/llm/src/protocols/openai.rs (1)
  • get_isl (312-312)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai/completions/delta.rs (1)
  • get_isl (131-133)
lib/llm/src/protocols/openai.rs (1)
  • get_isl (312-312)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (26)
lib/llm/src/protocols/codec.rs (1)

121-123: LGTM! Clean token metadata infrastructure addition.

The addition of the three token-related fields (chunk_tokens, input_tokens, output_tokens) initialized to None is consistent with the broader effort to add stream mode metrics. This non-breaking change properly extends the Annotated struct for token tracking.

lib/llm/src/protocols/openai/chat_completions/aggregator.rs (1)

287-289: LGTM! Test cases properly updated for new token fields.

The test helper functions correctly initialize the new token metadata fields to None, maintaining consistency with the production code changes. This ensures proper test coverage for the enhanced Annotated struct.

Also applies to: 433-435

lib/llm/src/protocols/openai.rs (1)

311-312: LGTM! Well-designed trait extension for ISL metrics.

The addition of get_isl() method to the DeltaGeneratorExt trait provides a clean interface for accessing Input Sequence Length metrics. The method signature and documentation are clear, and based on the relevant code snippets, implementations correctly return prompt_tokens from usage data.

lib/engines/mistralrs/src/lib.rs (1)

410-412: LGTM! Engine implementation consistently updated for token tracking.

Both the chat completion and text completion generation paths correctly include the new token metadata fields initialized to None. This maintains consistency across the engine implementations and prepares the infrastructure for token metrics collection in the processing pipeline.

Also applies to: 572-574

lib/llm/src/protocols/openai/completions/aggregator.rs (1)

208-210: LGTM! Consistent test data structure updates.

The addition of the new optional token-related fields (chunk_tokens, input_tokens, output_tokens) initialized to None in test data structures is consistent with the updated Annotated struct definition. This ensures test compatibility without introducing any functional changes.

Also applies to: 320-322

lib/llm/src/engines.rs (1)

205-205: LGTM! Consistent struct initialization in echo engine.

The explicit initialization of the new token-related fields to None is appropriate for the EchoEngineFull implementation, which simulates responses without tracking actual token metrics. This maintains consistency with the updated Annotated struct definition.

Also applies to: 213-213, 237-237, 241-241

lib/llm/src/protocols/openai/chat_completions/delta.rs (1)

216-218: LGTM! Clean implementation of ISL metric retrieval.

The get_isl method implementation correctly returns the prompt tokens count as the input sequence length (ISL) metric. The implementation is clean and aligns with the trait definition in lib/llm/src/protocols/openai.rs.

lib/runtime/src/protocols/annotated.rs (6)

40-44: LGTM! Well-structured token metadata addition.

The three new optional token fields are properly defined with appropriate serde attributes for optional serialization.


56-58: LGTM! Consistent field initialization.

The new token fields are properly initialized to None in the from_error constructor, maintaining consistency with the struct design.


69-71: LGTM! Consistent field initialization.

The new token fields are properly initialized to None in the from_data constructor.


87-89: LGTM! Consistent field initialization.

The new token fields are properly initialized to None in the from_annotation constructor.


125-127: LGTM! Proper field propagation.

The transfer method correctly propagates the token fields from the source to the target Annotated instance.


143-145: LGTM! Proper field propagation in transformation.

The map_data method correctly propagates the token fields during data transformation, ensuring metadata is preserved.

lib/llm/src/preprocessor.rs (4)

196-196: LGTM! Appropriate field addition for token tracking.

The cumulative_output_tokens field is correctly added to track output tokens across response chunks.


204-204: LGTM! Proper field initialization.

The cumulative_output_tokens field is correctly initialized to zero.


254-256: LGTM! Proper token metadata attachment.

The token counts are correctly attached to the response, providing comprehensive token usage information.


225-236:

✅ Verification successful

Verify the get_isl() method availability.

The token calculation logic looks correct, but ensure that the get_isl() method is properly implemented in the DeltaGeneratorExt trait.


🏁 Script executed:

#!/bin/bash
# Description: Verify that get_isl() method exists in DeltaGeneratorExt trait
# Expected: Find the method definition and implementations

# Search for the get_isl method definition in the trait
ast-grep --pattern $'trait DeltaGeneratorExt<$_> {
  $$$
  fn get_isl($_) -> $_ {
    $$$
  }
  $$$
}'

# Also search for get_isl method implementations
rg -A 3 "fn get_isl"

Length of output: 913


get_isl() implementation verified

  • The DeltaGeneratorExt trait in lib/llm/src/protocols/openai.rs declares fn get_isl(&self) -> Option<u32>;
  • Concrete implementations exist in:
    • lib/llm/src/protocols/openai/completions/delta.rs
    • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • No missing or unimplemented usages of get_isl() were found.

All checks pass—no further changes needed.

lib/llm/src/http/service/metrics.rs (4)

31-34: LGTM! Comprehensive token metrics addition.

The four new histogram metrics provide excellent coverage for token-related performance monitoring: ISL, OSL, TTFT, and ITL.


48-50: LGTM! Appropriate fields for token tracking.

The new fields in InflightGuard properly support token metrics calculation and timing.


299-301: LGTM! Proper field initialization.

The new fields in InflightGuard are correctly initialized with appropriate default values.


317-352: LGTM! Well-implemented metrics observation methods.

The observe_response method properly handles TTFT and ITL calculations, with appropriate logic for first token detection and per-token ITL recording.

lib/llm/src/http/service/openai.rs (5)

156-160: LGTM! Proper thread-safe wrapping of InflightGuard.

The Arc<Mutex<_>> wrapping enables safe sharing of the inflight guard across async tasks for metrics updates.


180-182: LGTM! Efficient stream processing with metrics.

The stream mapping correctly uses the new process_event_converter function to handle both event conversion and metrics updates.


277-281: LGTM! Consistent thread-safe implementation.

The chat completions handler uses the same thread-safe pattern as the completions handler.


303-305: LGTM! Consistent stream processing.

The chat completions stream processing follows the same pattern as completions.


469-480: LGTM! Proper conditional metrics updates.

The metrics updates are correctly gated by the presence of token data, ensuring metrics are only recorded when relevant information is available.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
lib/llm/src/http/service/metrics.rs (2)

313-313: Address model name sanitization concern.

This line records metrics using the raw model name. As mentioned in past reviews, consider sanitizing the model name to prevent issues with metric labels.

Apply sanitization to the model name before using it in metric labels:

-            .with_label_values(&[&self.model])
+            .with_label_values(&[&self.model.to_lowercase().replace('-', "_")])

335-335: TODO comment matches previous review feedback.

This TODO about publishing ISL as soon as tokenization completes aligns with previous review feedback. The current implementation publishes ISL with the first token response, which may not be optimal for timing accuracy.

Would you like me to help design a solution to publish ISL earlier in the pipeline, such as after request preprocessing?

🧹 Nitpick comments (2)
lib/llm/src/http/service/metrics.rs (2)

48-54: Good field documentation, but consider boolean naming convention.

The comments clearly explain the purpose of each field. However, based on past feedback about boolean naming, consider renaming is_first_token to has_first_token for clarity.

-    // we use is_first_token to distinguish TTFT from ITL. It is false by default and
-    // flip to true when the first token is returned and TTFT is published.
-    is_first_token: bool,
+    // we use has_first_token to distinguish TTFT from ITL. It is true by default and
+    // flips to false when the first token is returned and TTFT is published.
+    has_first_token: bool,

346-352: ITL calculation logic is correct but could be optimized.

The logic correctly calculates inter-token latency and records it once per token. However, the loop could be replaced with a single observation with the count parameter if Prometheus supports it.

Consider if this can be optimized (depending on Prometheus library capabilities):

-            for _ in 0..num_tokens {
-                self.metrics
-                    .inter_token_latency
-                    .with_label_values(&[&self.model])
-                    .observe(itl);
-            }
+            let histogram = self.metrics
+                .inter_token_latency
+                .with_label_values(&[&self.model]);
+            for _ in 0..num_tokens {
+                histogram.observe(itl);
+            }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c332158 and f650358.

📒 Files selected for processing (1)
  • lib/llm/src/http/service/metrics.rs (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (3)
lib/llm/src/http/service/metrics.rs (3)

6-9: LGTM: Import additions are appropriate.

The additional imports for Duration and Instant are necessary for the new timing functionality.


31-34: LGTM: New metric fields follow established patterns.

The four new histogram fields are well-named and consistent with existing metrics structure.


132-180: Verify histogram bucket ranges for production workloads.

The bucket ranges look reasonable, but please confirm they align with expected production workloads:

  • ISL buckets go up to 128k tokens (good for long context models)
  • OSL buckets max at 32k tokens
  • TTFT buckets range from 1ms to 8 minutes
  • ITL buckets range from 1ms to 2 seconds

Consider validating these ranges against your typical workload patterns to ensure good histogram resolution.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/llm/src/http/service/metrics.rs (1)

434-446: ITL calculation methodology is reasonable but consider documenting the averaging approach.

The implementation calculates average inter-token latency across all tokens in a response chunk, which is a practical approximation given the available timing data. However, this approach may not capture true per-token latency variations within a chunk.

Consider adding a comment to clarify this design choice:

+        // Calculate average ITL for all tokens in this response chunk
+        // This is an approximation since individual token timing isn't available
         let response_duration = current_duration - last_response_time;
         let itl = response_duration.as_secs_f64() / num_tokens as f64;
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68912c5 and d2a1b63.

📒 Files selected for processing (2)
  • lib/llm/src/http/service/metrics.rs (8 hunks)
  • lib/llm/src/http/service/openai.rs (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/llm/src/http/service/openai.rs
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (7)
lib/llm/src/http/service/metrics.rs (7)

6-9: LGTM! Proper imports for timing functionality.

The additional imports for Duration and Instant are correctly added to support the new timing-based metrics.


31-34: LGTM! New histogram metrics properly declared.

The four new histogram metrics are correctly added to the Metrics struct for tracking input/output sequence lengths and timing metrics.


78-90: Well-designed ResponseMetricCollector struct.

The struct properly encapsulates all necessary state for tracking response metrics:

  • Uses Arc<Metrics> for thread-safe sharing
  • Tracks timing state with start_time and last_response_time
  • Uses boolean flag to distinguish first token (TTFT) vs subsequent tokens (ITL)
  • Maintains running count of output sequence length

139-187: Excellent bucket configurations for different metric types.

The histogram buckets are well-chosen for each metric:

  • Input/output sequence length: Exponential buckets from 0 to 128K tokens covering typical use cases
  • TTFT: Fine-grained buckets for sub-second measurements up to 8 minutes
  • ITL: Very fine-grained buckets for measuring inter-token delays in milliseconds

The bucket ranges align well with expected LLM performance characteristics.


285-296: Good integration of model name sanitization.

The to_lowercase() conversion addresses the previous review comment about model name sanitization and ensures consistent metric labeling.


407-430: Robust implementation with proper edge case handling.

The method correctly:

  • Guards against division by zero with early return
  • Distinguishes first token (TTFT) from subsequent tokens (ITL)
  • Publishes ISL only once on first token
  • Includes the TODO comment as requested in previous reviews

449-457: Proper RAII pattern for final metric publication.

The Drop implementation ensures OSL is always published when the collector is destroyed, preventing metric loss even if the response processing is interrupted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants