-
Notifications
You must be signed in to change notification settings - Fork 651
feat: LLM metrics for non-streaming requests in frontend #2427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughNon-streaming completions and chat completions now pass responses through a new helper that extracts LLM metric annotations and updates response metrics, then returns the original response. Streaming paths and public APIs remain unchanged. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant OpenAIService
participant Provider
participant Metrics as ResponseMetricCollector
Client->>OpenAIService: completions/chat (non-streaming)
OpenAIService->>Provider: send request
Provider-->>OpenAIService: responses[]
loop each response
OpenAIService->>Metrics: process_metrics_only(annotated)
Metrics-->>OpenAIService: metrics observed
OpenAIService-->>OpenAIService: keep original response
end
OpenAIService-->>Client: aggregated response
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/llm/src/http/service/openai.rs (3)
293-299
: Prefer StreamExt::inspect to tap metrics without altering stream itemsUsing map here returns the original item unmodified. inspect expresses the intent better and avoids unnecessary moves.
Two options:
- Option A (preferred): switch to inspect.
- Option B: if NvCreateCompletionResponse::from_annotated_stream does not already ignore LLM metrics annotations, keep map but also chomp the metrics event (parity with streaming) to prevent internal annotations from leaking into the aggregated JSON.
Option A:
- // Process the stream to collect metrics for non-streaming requests - let stream = stream.map(move |response| { - // Process metrics but return the original response for aggregation - process_metrics_only(&response, &mut response_collector); - response - }); + // Tap the stream to collect metrics for non-streaming requests without altering items + let stream = stream.inspect(move |response| { + process_metrics_only(response, &mut response_collector); + });Option B (parity with streaming; remove metrics annotations from items):
- // Process the stream to collect metrics for non-streaming requests - let stream = stream.map(move |response| { - // Process metrics but return the original response for aggregation - process_metrics_only(&response, &mut response_collector); - response - }); + // Process metrics and chomp metrics annotations to avoid leaking internal events + let stream = stream.map(move |mut response| { + process_metrics_only(&response, &mut response_collector); + if response.event.as_deref() == Some(crate::preprocessor::ANNOTATION_LLM_METRICS) { + response.event = None; + response.comment = None; + } + response + });Follow-up: Please confirm whether from_annotated_stream discards ANNOTATION_LLM_METRICS events so we can safely use Option A. If not, Option B maintains user-visible parity with the streaming path.
524-530
: Mirror the non-streaming tap with inspect and consider chomping metrics annotationsSame rationale as in completions: inspect better conveys tapping for metrics; chomp if the folder doesn’t already ignore LLM metrics annotations.
Option A:
- // Process the stream to collect metrics for non-streaming requests - let stream = stream.map(move |response| { - // Process metrics but return the original response for aggregation - process_metrics_only(&response, &mut response_collector); - response - }); + // Tap the stream to collect metrics for non-streaming requests without altering items + let stream = stream.inspect(move |response| { + process_metrics_only(response, &mut response_collector); + });Option B (parity with streaming; remove metrics annotations from items):
- let stream = stream.map(move |response| { - // Process metrics but return the original response for aggregation - process_metrics_only(&response, &mut response_collector); - response - }); + let stream = stream.map(move |mut response| { + process_metrics_only(&response, &mut response_collector); + if response.event.as_deref() == Some(crate::preprocessor::ANNOTATION_LLM_METRICS) { + response.event = None; + response.comment = None; + } + response + });Please verify whether NvCreateChatCompletionResponse::from_annotated_stream intentionally drops internal metrics events; if not, Option B prevents them from surfacing in the final JSON.
926-936
: Helper looks good; add a brief doc comment (and optionally #[inline])Small readability nit: document intent and parity with streaming, since this is now shared across endpoints.
-fn process_metrics_only<T>( +/// Tap LLMMetricAnnotation events to update response metrics without mutating or filtering the item. +/// Used in non-streaming paths to collect token counts while preserving original stream items. +#[inline] +fn process_metrics_only<T>( annotated: &Annotated<T>, response_collector: &mut ResponseMetricCollector, ) { // update metrics if let Ok(Some(metrics)) = LLMMetricAnnotation::from_annotation(annotated) { response_collector.observe_current_osl(metrics.output_tokens); response_collector.observe_response(metrics.input_tokens, metrics.chunk_tokens); } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/llm/src/http/service/openai.rs
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
@tedzhouhk What do you think about the Code Rabbit nit about |
Good call! I didn't know we have inspect, let me modify and test |
Signed-off-by: Hannah Zhang <[email protected]>
Summary by CodeRabbit