Skip to content

Conversation

@hakkapelli
Copy link

@hakkapelli hakkapelli commented Sep 10, 2025

Overview:

Adds request template support to /v1/completions endpoint to match existing /v1/chat/completions functionality. Templates now provide default values for model, temperature, and max_tokens when not specified in completion requests.

Details:

  • Modified handler_completions to parse JSON before deserialization and apply template values
  • Updated completions_router to accept template parameter
  • Added --request-template CLI argument to Python frontend
  • Template values applied before JSON validation, enabling minimal requests like {"prompt": "hello"}

Where should the reviewer start?

  • lib/llm/src/http/service/openai.rs - Main template application logic in handler_completions
  • components/frontend/src/dynamo/frontend/main.py - CLI argument addition

Related Issues:

Summary by CodeRabbit

  • New Features

    • Optional --request-template flag to load default request parameters (model, temperature, max tokens) from a JSON file.
    • Server now applies sensible defaults for missing or zeroed completion parameters.
    • Streaming responses include usage details for reasoning tokens.
  • Bug Fixes

    • Clearer 400 errors for invalid JSON and invalid request formats in completions requests.

- Add reasoning_token_count field to DeltaGenerator struct
- Implement update_reasoning_tokens() method for accurate token counting
- Populate completion_tokens_details.reasoning_tokens in usage statistics
- Support both streaming and non-streaming responses
- Enable observability of reasoning token usage for reasoning models

Fixes ai-dynamo#2941

Signed-off-by: Harshith Akkapelli <[email protected]>
Implements request template functionality for /v1/completions endpoint
to match existing /v1/chat/completions behavior. Templates now provide
default values for model, temperature, and max_tokens when not specified
in completion requests.

Changes:
- Modified handler_completions to parse JSON before deserialization
- Added template application logic for completion requests
- Updated completions_router to accept template parameter
- Added --request-template CLI argument to Python frontend
- Template values are applied before JSON validation, allowing
  minimal requests like {"prompt": "hello"} to work with templates

Fixes ai-dynamo#2494

Signed-off-by: Harshith Akkapelli [email protected]
@hakkapelli hakkapelli requested review from a team as code owners September 10, 2025 09:25
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi hakkapelli! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Sep 10, 2025
Signed-off-by: Harshith Akkapelli <[email protected]>
@pull-request-size pull-request-size bot added size/M and removed size/L labels Sep 10, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 10, 2025

Walkthrough

Adds an optional --request-template CLI flag to pass a template file into engine startup, threads RequestTemplate through HTTP routing for completions/chat/responses, applies template defaults to completion requests with improved input error handling, and augments chat streaming with internal reasoning token counting surfaced in usage details.

Changes

Cohort / File(s) Summary
Frontend CLI templating flag
components/frontend/src/dynamo/frontend/main.py
Adds --request-template (pathlib.Path) to parse_args; passes as template_file kwarg to EntrypointArgs for optional default request parameters.
HTTP routing & request templating (OpenAI endpoints)
lib/llm/src/http/service/openai.rs, lib/llm/src/http/service/service_v2.rs
Threads RequestTemplate through routers: updated signatures for completions_router, chat_completions_router, responses_router; stores (state, template) in router state; completions handler now reads raw body, parses JSON, applies template defaults (model/temperature/max_tokens), improves 400 error texts, logs receipt, and forwards template into processing.
Reasoning token observability in chat streaming
lib/llm/src/protocols/openai/chat_completions/delta.rs
Adds reasoning_token_count to DeltaGenerator; introduces private update_reasoning_tokens(...); updates count during streaming/post-processing when reasoning content present; includes reasoning_tokens in completion usage details when >0.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as Frontend CLI
  participant Svc as service_v2 Router
  participant OAI as OpenAI Routers
  participant H as Completions Handler
  participant T as RequestTemplate
  participant Eng as Engine Task

  User->>CLI: run with --request-template=path.json (optional)
  CLI->>Svc: start service with template_file (if provided)
  Svc->>OAI: build routers with (state, template)
  User->>OAI: POST /v1/completions (JSON body)
  OAI->>H: invoke handler with raw body + template from state
  H->>H: parse JSON or return 400 "Invalid JSON: …"
  H->>T: apply defaults (model, temperature, max_tokens) when missing/zero
  H->>H: deserialize to NvCreateCompletionRequest or 400 "Invalid request format: …"
  H->>Eng: spawn/await completions task
  Eng-->>H: result or error
  H-->>User: response
  note right of H: Logs "Received completions request: …"
Loading
sequenceDiagram
  autonumber
  participant Gen as DeltaGenerator
  participant Stream as Stream Loop
  participant Usage as Usage Details

  Stream->>Gen: emit delta with content + optional reasoning_text
  alt reasoning content present
    Gen->>Gen: update_reasoning_tokens(reasoning_text, total_text, token_ids)
  end
  Stream->>Usage: assemble completion_tokens_details
  alt reasoning_token_count > 0
    Usage->>Usage: include reasoning_tokens
  end
  Usage-->>Stream: usage snapshot per choice
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Pre-merge checks (3 passed, 2 warnings)

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The current title “Feat/completion request template 2494” includes an issue number and uses unconventional slash notation, which makes it less clear and concise than the standard imperative sentence form expected for PR titles. It references the feature but does not succinctly describe the primary change from the developer’s perspective. Rename the title to a clear, concise imperative sentence that highlights the main change, for example “Add request template support to /v1/completions endpoint.”
Out of Scope Changes Check ⚠️ Warning The PR introduces changes in chat_completions/delta.rs to add reasoning token observability, which is unrelated to request_template support for the completions endpoint and falls outside the scope defined by issue #2494. Extract the reasoning token observability additions into a separate PR so this one remains focused solely on implementing request templates for the completions endpoint.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues Check ✅ Passed The PR implements all coding requirements from issue #2494 by updating handler_completions to apply template defaults, modifying completions_router signatures to accept a template parameter, and adding the --request-template CLI flag in the Python frontend, thereby honoring request templates in the /v1/completions endpoint.
Description Check ✅ Passed The description follows the repository template by including an Overview, Details, review entry points, and a Related Issues section with the appropriate action keyword “Fixes #2494,” thereby covering all required sections and providing clear information for reviewers.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Poem

A rabbit taps keys with a twitch of delight,
Templates now whisper defaults in the night.
Completions comply, with tidy attire,
Reasoning tokens dance by the fire.
Routers pass notes, handlers make sense—
Hop! The requests are now less intense.
Thump-thump: shipped with confidence.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
components/frontend/src/dynamo/frontend/main.py (1)

203-208: Validate template path and clarify help text

  • Add a validator to ensure the file exists and is JSON.
  • Clarify “max_tokens vs max_completion_tokens” to avoid confusion across endpoints.
-    parser.add_argument(
-        "--request-template",
-        type=pathlib.Path,
-        default=None,
-        help="Path to JSON file containing default request parameters (model, temperature, max_completion_tokens).",
-    )
+    parser.add_argument(
+        "--request-template",
+        type=validate_request_template_path,
+        default=None,
+        help="Path to JSON file with default request parameters (model, temperature, max_tokens/max_completion_tokens).",
+    )

Add this helper (outside the shown hunk) to mirror existing validators:

def validate_request_template_path(value: pathlib.Path):
    p = pathlib.Path(value)
    if not p.is_file():
        raise argparse.ArgumentTypeError(f"--request-template must point to an existing file, got: {value}")
    if p.suffix.lower() != ".json":
        raise argparse.ArgumentTypeError(f"--request-template must be a .json file, got: {value}")
    return p
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

138-158: Make token accounting safer (avoid overflows, skip zero-token deltas)

  • Skip computation when token_ids is empty to save work.
  • Use saturating_add for defensive accumulation.
-    fn update_reasoning_tokens(&mut self, reasoning_text: &str, total_text: &str, token_ids: &[u32]) {
-        if !reasoning_text.is_empty() && !total_text.is_empty() {
+    fn update_reasoning_tokens(&mut self, reasoning_text: &str, total_text: &str, token_ids: &[u32]) {
+        if token_ids.is_empty() { return; }
+        if !reasoning_text.is_empty() && !total_text.is_empty() {
             // Estimate reasoning tokens based on the proportion of reasoning text to total text
             // This is more accurate than word counting alone
             let reasoning_chars = reasoning_text.chars().count() as f32;
             let total_chars = total_text.chars().count() as f32;
 
             if total_chars > 0.0 {
                 let reasoning_proportion = reasoning_chars / total_chars;
                 let estimated_reasoning_tokens = (token_ids.len() as f32 * reasoning_proportion).ceil() as u32;
-                self.reasoning_token_count += estimated_reasoning_tokens;
+                self.reasoning_token_count = self.reasoning_token_count.saturating_add(estimated_reasoning_tokens);
             }
         }
     }

378-383: Micro-guard to avoid unnecessary work when no tokens emitted

Saves char counting on empty-token deltas.

-        if let Some(ref reasoning_text) = reasoning_content {
-            let total_text = delta.text.as_deref().unwrap_or("");
-            self.update_reasoning_tokens(reasoning_text, total_text, &delta.token_ids);
-        }
+        if let Some(ref reasoning_text) = reasoning_content {
+            if !delta.token_ids.is_empty() {
+                let total_text = delta.text.as_deref().unwrap_or("");
+                self.update_reasoning_tokens(reasoning_text, total_text, &delta.token_ids);
+            }
+        }
lib/llm/src/http/service/openai.rs (3)

214-217: Prefer Bytes extractor to avoid extra copy and enable zero-copy JSON parsing

Using String forces an extra allocation and UTF-8 validation. Bytes + from_slice is leaner.

-async fn handler_completions(
-    State((state, template)): State<(Arc<service_v2::State>, Option<RequestTemplate>)>,
-    headers: HeaderMap,
-    body: String,
-) -> Result<Response, ErrorResponse> {
+async fn handler_completions(
+    State((state, template)): State<(Arc<service_v2::State>, Option<RequestTemplate>)>,
+    headers: HeaderMap,
+    body: axum::body::Bytes,
+) -> Result<Response, ErrorResponse> {

221-231: Parse from bytes, not string

Follow-up to the Bytes change: use serde_json::from_slice.

-    let mut json_value: serde_json::Value = serde_json::from_str(&body)
+    let mut json_value: serde_json::Value = serde_json::from_slice(&body)
         .map_err(|e| {
             (
                 StatusCode::BAD_REQUEST,
                 Json(ErrorMessage {
                     error: format!("Invalid JSON: {}", e),
                 }),
             )
         })?;

292-292: Unused template in completions()

_template isn’t used. Either drop the parameter and the corresponding argument at the call site, or leave as-is if you plan to use it soon. Minor clarity win to remove now.

-async fn completions(
-    state: Arc<service_v2::State>,
-    _template: Option<RequestTemplate>,
+async fn completions(
+    state: Arc<service_v2::State>,
     request: Context<NvCreateCompletionRequest>,
     stream_handle: ConnectionHandle,
 ) -> Result<Response, ErrorResponse> {

And at the call site:

-    let response = tokio::spawn(completions(state, template, request, stream_handle).in_current_span())
+    let response = tokio::spawn(completions(state, request, stream_handle).in_current_span())
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 905c920 and 4c62751.

📒 Files selected for processing (4)
  • components/frontend/src/dynamo/frontend/main.py (2 hunks)
  • lib/llm/src/http/service/openai.rs (4 hunks)
  • lib/llm/src/http/service/service_v2.rs (1 hunks)
  • lib/llm/src/protocols/openai/chat_completions/delta.rs (5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/delta.rs
🧬 Code graph analysis (2)
components/frontend/src/dynamo/frontend/main.py (1)
lib/llm/src/local_model.rs (3)
  • default (67-89)
  • request_template (154-157)
  • request_template (350-352)
lib/llm/src/http/service/service_v2.rs (2)
lib/llm/src/http/service/openai.rs (1)
  • completions_router (1067-1079)
lib/llm/src/local_model.rs (2)
  • request_template (154-157)
  • request_template (350-352)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2984/merge) by hakkapelli.
lib/llm/src/http/service/openai.rs

[error] 228-236: Trailing whitespace detected by pre-commit 'trailing-whitespace' hook; file was auto-fixed. CI exited with code 1.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (6)
components/frontend/src/dynamo/frontend/main.py (1)

271-272: No change needed for template_file Pyo3’s Option<PathBuf> parameters accept Python str, os.PathLike, and pathlib.Path natively, so passing flags.request_template directly is safe and casting to str is unnecessary. (pyo3.rs)

lib/llm/src/protocols/openai/chat_completions/delta.rs (1)

60-61: LGTM: field added and initialized

The dedicated reasoning_token_count with zero-init is straightforward and contained.

Also applies to: 123-127

lib/llm/src/http/service/service_v2.rs (1)

371-375: LGTM – RequestTemplate passed through completions_router for parity. Matches chat/responses routing and enables templated defaults on /v1/completions; no remaining outdated call sites found.

lib/llm/src/http/service/openai.rs (3)

253-262: Clear 400 mapping for deserialization errors — looks good

The “Invalid request format: …” 400 is appropriate after templating and before engine invocation.


273-281: Tokio spawn with span propagation — LGTM

Spawning the task with in_current_span() and a tailored await error message is correct.


1069-1077: Router state wiring with template — LGTM

Passing (state, template) and setting a body limit here aligns with the handler signature change.

Comment on lines +232 to +250
// Apply template values to JSON before deserialization
if let Some(template) = &template {
if let Some(obj) = json_value.as_object_mut() {
// Apply model from template if not present or empty
if !obj.contains_key("model") || obj.get("model").and_then(|v| v.as_str()).map_or(true, |s| s.is_empty()) {
obj.insert("model".to_string(), serde_json::Value::String(template.model.clone()));
}

// Apply temperature from template if not present or zero
if !obj.contains_key("temperature") || obj.get("temperature").and_then(|v| v.as_f64()).map_or(true, |t| t == 0.0) {
obj.insert("temperature".to_string(), serde_json::Value::Number(serde_json::Number::from_f64(template.temperature as f64).unwrap()));
}

// Apply max_tokens from template if not present or zero
if !obj.contains_key("max_tokens") || obj.get("max_tokens").and_then(|v| v.as_u64()).map_or(true, |t| t == 0) {
obj.insert("max_tokens".to_string(), serde_json::Value::Number(serde_json::Number::from(template.max_completion_tokens)));
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Template application: avoid overriding explicit values and remove potential panic

  • Don’t override user-specified temperature/max_tokens when explicitly set to 0; 0.0 can be a deliberate choice. Current chat path treats 0.0 as “unset”; confirm if that coupling is intentional. If not, only apply when the field is missing or null.
  • Number::from_f64(...).unwrap() can panic for NaN/Infinity. Handle None defensively.
-        if let Some(obj) = json_value.as_object_mut() {
-            // Apply model from template if not present or empty
-            if !obj.contains_key("model") || obj.get("model").and_then(|v| v.as_str()).map_or(true, |s| s.is_empty()) {
-                obj.insert("model".to_string(), serde_json::Value::String(template.model.clone()));
-            }
-            
-            // Apply temperature from template if not present or zero
-            if !obj.contains_key("temperature") || obj.get("temperature").and_then(|v| v.as_f64()).map_or(true, |t| t == 0.0) {
-                obj.insert("temperature".to_string(), serde_json::Value::Number(serde_json::Number::from_f64(template.temperature as f64).unwrap()));
-            }
-            
-            // Apply max_tokens from template if not present or zero
-            if !obj.contains_key("max_tokens") || obj.get("max_tokens").and_then(|v| v.as_u64()).map_or(true, |t| t == 0) {
-                obj.insert("max_tokens".to_string(), serde_json::Value::Number(serde_json::Number::from(template.max_completion_tokens)));
-            }
-        }
+        if let Some(obj) = json_value.as_object_mut() {
+            // model: apply if missing, null, or blank
+            if obj
+                .get("model")
+                .and_then(|v| v.as_str())
+                .map(|s| s.trim().is_empty())
+                .unwrap_or(true)
+            {
+                obj.insert("model".into(), serde_json::Value::String(template.model.clone()));
+            }
+
+            // temperature: apply only if missing or null (do not override explicit 0.0)
+            if obj.get("temperature").map(|v| v.is_null()).unwrap_or(true) {
+                if let Some(num) = serde_json::Number::from_f64(template.temperature as f64) {
+                    obj.insert("temperature".into(), serde_json::Value::Number(num));
+                } else {
+                    tracing::warn!("Ignoring non-finite template.temperature={}", template.temperature);
+                }
+            }
+
+            // max_tokens: apply only if missing or null (do not override explicit 0)
+            if obj.get("max_tokens").map(|v| v.is_null()).unwrap_or(true) {
+                obj.insert(
+                    "max_tokens".into(),
+                    serde_json::Value::Number(serde_json::Number::from(template.max_completion_tokens as u64)),
+                );
+            }
+        }

Avoid overriding explicit zero values and guard potential panic

  • Apply temperature and max_tokens only when the fields are missing or null—an explicit 0.0 or 0 may be a deliberate choice.
  • Replace .unwrap() on serde_json::Number::from_f64(…) (which returns None for NaN/Infinity) with an if let Some(num) = … or equivalent fallback and log or error on non-finite values (to prevent panics). (docs.rs)
🤖 Prompt for AI Agents
lib/llm/src/http/service/openai.rs around lines 232-250: change the template
application logic so temperature and max_tokens are only applied when the JSON
object does not contain the field or the field is null (do not treat explicit
0.0 or 0 as absent), and replace the serde_json::Number::from_f64(...).unwrap()
with a safe handling pattern (e.g. if let Some(num) = Number::from_f64(...) {
insert num } else { log or return an error/skip insertion }) to avoid panics on
NaN/Infinity.

check_ready(&state)?;

let request_id = request.id().to_string();
tracing::trace!("Received completions request: {:?}", request.inner);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid logging full request bodies (PII/secrets risk)

{:?} request.inner can leak prompts and user data. Log only safe metadata (e.g., request_id/model/stream flags) as done in chat with content()-style logging.

-    tracing::trace!("Received completions request: {:?}", request.inner);
+    // Avoid logging prompts/PII; keep metadata-only.
+    tracing::trace!("Received completions request: request_id={}", request_id);

If you want model/stream info too, move this log below the model and streaming derivations and log:

tracing::trace!("Received completions request: request_id={}, model={}, streaming={}", request_id, model, streaming);
🤖 Prompt for AI Agents
In lib/llm/src/http/service/openai.rs around line 300, the current trace log
prints the full request body (tracing::trace!("Received completions request:
{:?}", request.inner)), which can leak prompts/PII; replace this by removing the
full request dump and instead log only safe metadata after you derive
request_id, model, and streaming: move the trace call to below those derivations
and emit a trace like "Received completions request: request_id={}, model={},
streaming={}" (populate with the local variables) so no request.inner or prompt
content is logged.

Comment on lines 285 to 291
// Update completion_tokens_details with reasoning tokens if we have any
if self.reasoning_token_count > 0 {
// Preserve existing completion_tokens_details if they exist, or create new ones
let mut details = usage.completion_tokens_details.unwrap_or_default();
details.reasoning_tokens = Some(self.reasoning_token_count);
usage.completion_tokens_details = Some(details);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clamp reasoning_tokens to completion_tokens

Prevents reporting more reasoning_tokens than completion_tokens due to proportional rounding across chunks.

-            // Update completion_tokens_details with reasoning tokens if we have any
-            if self.reasoning_token_count > 0 {
-                // Preserve existing completion_tokens_details if they exist, or create new ones
-                let mut details = usage.completion_tokens_details.unwrap_or_default();
-                details.reasoning_tokens = Some(self.reasoning_token_count);
-                usage.completion_tokens_details = Some(details);
-            }
+            // Update completion_tokens_details with reasoning tokens if we have any
+            if self.reasoning_token_count > 0 {
+                let mut details = usage.completion_tokens_details.unwrap_or_default();
+                let clamped = self.reasoning_token_count.min(usage.completion_tokens);
+                details.reasoning_tokens = Some(clamped);
+                usage.completion_tokens_details = Some(details);
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Update completion_tokens_details with reasoning tokens if we have any
if self.reasoning_token_count > 0 {
// Preserve existing completion_tokens_details if they exist, or create new ones
let mut details = usage.completion_tokens_details.unwrap_or_default();
details.reasoning_tokens = Some(self.reasoning_token_count);
usage.completion_tokens_details = Some(details);
}
// Update completion_tokens_details with reasoning tokens if we have any
if self.reasoning_token_count > 0 {
let mut details = usage.completion_tokens_details.unwrap_or_default();
let clamped = self.reasoning_token_count.min(usage.completion_tokens);
details.reasoning_tokens = Some(clamped);
usage.completion_tokens_details = Some(details);
}
🤖 Prompt for AI Agents
In lib/llm/src/protocols/openai/chat_completions/delta.rs around lines 285 to
291, clamp reasoning_tokens so it never exceeds completion_tokens: when building
or updating completion_tokens_details, read completion_tokens from usage
(fallback 0 if None), compute the minimum of self.reasoning_token_count and that
completion_tokens value, and assign that min value to details.reasoning_tokens
before storing details back on usage.

State((state, template)): State<(Arc<service_v2::State>, Option<RequestTemplate>)>,
headers: HeaderMap,
Json(request): Json<NvCreateCompletionRequest>,
body: String,
Copy link
Contributor

@grahamking grahamking Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Could you use the same approach as chat completions, modifying request.inner.temperature etc?

kwargs["namespace"] = flags.namespace
if flags.request_template:
kwargs["template_file"] = flags.request_template

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the rest of the PR doesn't land, I would be happy for us to take just this file. It brings the python CLI closer to dynamo-run (https://github.com/ai-dynamo/dynamo/blob/main/launch/dynamo-run/src/flags.rs#L139), and that flag is already useful for chat completions.

@github-actions
Copy link

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 11, 2025
@github-actions
Copy link

This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information.

@github-actions github-actions bot closed this Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor size/M Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: request_template should be applied to completion endpoint as well

2 participants