feat: LLM metrics for non-streaming requests in frontend #2427

tedzhouhk · 2025-08-13T15:40:02Z

Summary by CodeRabbit

Chores
- Enhanced internal observability for non-streaming completions and chat completions by recording response and token metrics.
- Improves monitoring and analytics without changing outputs, error handling, or the public API.
- Streaming behavior remains unchanged; only non-streaming paths gain telemetry.
- Aids capacity planning, usage reporting, and troubleshooting through more accurate metric collection.
- No user-facing changes; functionality and performance remain consistent while backend insights increase.

copy-pr-bot · 2025-08-13T15:40:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-08-13T15:44:51Z

Walkthrough

Non-streaming completions and chat completions now pass responses through a new helper that extracts LLM metric annotations and updates response metrics, then returns the original response. Streaming paths and public APIs remain unchanged.

Changes

Cohort / File(s)	Summary
OpenAI non-streaming metrics integration `lib/llm/src/http/service/openai.rs`	Added process_metrics_only(...) helper to extract LLMMetricAnnotation and update ResponseMetricCollector via observe_current_osl/observe_response; wired into non-streaming completions and chat_completions map/collect flow; no changes to streaming or error handling surfaces.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant OpenAIService
  participant Provider
  participant Metrics as ResponseMetricCollector

  Client->>OpenAIService: completions/chat (non-streaming)
  OpenAIService->>Provider: send request
  Provider-->>OpenAIService: responses[]
  loop each response
    OpenAIService->>Metrics: process_metrics_only(annotated)
    Metrics-->>OpenAIService: metrics observed
    OpenAIService-->>OpenAIService: keep original response
  end
  OpenAIService-->>Client: aggregated response

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

feat: add more metrics to rust frontend #1315: Introduces streaming token/latency metrics and ResponseMetricCollector used by this PR’s non-streaming integration.
refactor: use comment filed in annotated to pass metric-related information #1385: Refactors token metrics into annotations and adds LLMMetricAnnotation leveraged by the new helper here.

Poem

A bunny counts tokens with delicate care,
Hops through responses, metrics everywhere.
No streams disturbed, just gentle taps,
Annotations whisper their tidy maps.
Ears up, charts up—collect, then go,
Another neat burrow in data’s flow. 🐇📈

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

lib/llm/src/http/service/openai.rs (3)

293-299: Prefer StreamExt::inspect to tap metrics without altering stream items

Using map here returns the original item unmodified. inspect expresses the intent better and avoids unnecessary moves.

Two options:

Option A (preferred): switch to inspect.
Option B: if NvCreateCompletionResponse::from_annotated_stream does not already ignore LLM metrics annotations, keep map but also chomp the metrics event (parity with streaming) to prevent internal annotations from leaking into the aggregated JSON.

Option A:

-        // Process the stream to collect metrics for non-streaming requests
-        let stream = stream.map(move |response| {
-            // Process metrics but return the original response for aggregation
-            process_metrics_only(&response, &mut response_collector);
-            response
-        });
+        // Tap the stream to collect metrics for non-streaming requests without altering items
+        let stream = stream.inspect(move |response| {
+            process_metrics_only(response, &mut response_collector);
+        });

Option B (parity with streaming; remove metrics annotations from items):

-        // Process the stream to collect metrics for non-streaming requests
-        let stream = stream.map(move |response| {
-            // Process metrics but return the original response for aggregation
-            process_metrics_only(&response, &mut response_collector);
-            response
-        });
+        // Process metrics and chomp metrics annotations to avoid leaking internal events
+        let stream = stream.map(move |mut response| {
+            process_metrics_only(&response, &mut response_collector);
+            if response.event.as_deref() == Some(crate::preprocessor::ANNOTATION_LLM_METRICS) {
+                response.event = None;
+                response.comment = None;
+            }
+            response
+        });

Follow-up: Please confirm whether from_annotated_stream discards ANNOTATION_LLM_METRICS events so we can safely use Option A. If not, Option B maintains user-visible parity with the streaming path.

524-530: Mirror the non-streaming tap with inspect and consider chomping metrics annotations

Same rationale as in completions: inspect better conveys tapping for metrics; chomp if the folder doesn’t already ignore LLM metrics annotations.

Option A:

-        // Process the stream to collect metrics for non-streaming requests
-        let stream = stream.map(move |response| {
-            // Process metrics but return the original response for aggregation
-            process_metrics_only(&response, &mut response_collector);
-            response
-        });
+        // Tap the stream to collect metrics for non-streaming requests without altering items
+        let stream = stream.inspect(move |response| {
+            process_metrics_only(response, &mut response_collector);
+        });

Option B (parity with streaming; remove metrics annotations from items):

-        let stream = stream.map(move |response| {
-            // Process metrics but return the original response for aggregation
-            process_metrics_only(&response, &mut response_collector);
-            response
-        });
+        let stream = stream.map(move |mut response| {
+            process_metrics_only(&response, &mut response_collector);
+            if response.event.as_deref() == Some(crate::preprocessor::ANNOTATION_LLM_METRICS) {
+                response.event = None;
+                response.comment = None;
+            }
+            response
+        });

Please verify whether NvCreateChatCompletionResponse::from_annotated_stream intentionally drops internal metrics events; if not, Option B prevents them from surfacing in the final JSON.

926-936: Helper looks good; add a brief doc comment (and optionally #[inline])

Small readability nit: document intent and parity with streaming, since this is now shared across endpoints.

-fn process_metrics_only<T>(
+/// Tap LLMMetricAnnotation events to update response metrics without mutating or filtering the item.
+/// Used in non-streaming paths to collect token counts while preserving original stream items.
+#[inline]
+fn process_metrics_only<T>(
     annotated: &Annotated<T>,
     response_collector: &mut ResponseMetricCollector,
 ) {
     // update metrics
     if let Ok(Some(metrics)) = LLMMetricAnnotation::from_annotation(annotated) {
         response_collector.observe_current_osl(metrics.output_tokens);
         response_collector.observe_response(metrics.input_tokens, metrics.chunk_tokens);
     }
 }

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72ec5f5 and 9d2323e.

📒 Files selected for processing (1)

lib/llm/src/http/service/openai.rs (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)

grahamking · 2025-08-13T15:56:15Z

@tedzhouhk What do you think about the Code Rabbit nit about stream.inspect?
#2427 (review)

tedzhouhk · 2025-08-13T16:11:23Z

@tedzhouhk What do you think about the Code Rabbit nit about stream.inspect? #2427 (review)

Good call! I didn't know we have inspect, let me modify and test

Signed-off-by: Hannah Zhang <[email protected]>

feat: metrics for non-streaming

9d2323e

tedzhouhk requested a review from a team as a code owner August 13, 2025 15:40

pull-request-size bot added the size/S label Aug 13, 2025

github-actions bot added the feat label Aug 13, 2025

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

use inspect instead of map

a42dafb

grahamking approved these changes Aug 13, 2025

View reviewed changes

tedzhouhk merged commit c3ecaf6 into main Aug 13, 2025
10 checks passed

tedzhouhk deleted the hzhou/nonstream-metric branch August 13, 2025 17:01

coderabbitai bot mentioned this pull request Aug 22, 2025

feat: KServe gRPC support #2638

Merged

hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025

feat: LLM metrics for non-streaming requests in frontend (#2427)

f6711de

Signed-off-by: Hannah Zhang <[email protected]>

coderabbitai bot mentioned this pull request Sep 16, 2025

feat: add audit logging for chat completions #3062

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: LLM metrics for non-streaming requests in frontend #2427

feat: LLM metrics for non-streaming requests in frontend #2427

Uh oh!

tedzhouhk commented Aug 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Aug 13, 2025

Uh oh!

coderabbitai bot commented Aug 13, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

grahamking commented Aug 13, 2025

Uh oh!

tedzhouhk commented Aug 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: LLM metrics for non-streaming requests in frontend #2427

feat: LLM metrics for non-streaming requests in frontend #2427

Uh oh!

Conversation

tedzhouhk commented Aug 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 13, 2025

Uh oh!

coderabbitai bot commented Aug 13, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

grahamking commented Aug 13, 2025

Uh oh!

tedzhouhk commented Aug 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tedzhouhk commented Aug 13, 2025 •

edited by coderabbitai bot

Loading