-
Notifications
You must be signed in to change notification settings - Fork 693
Feat/completion request template 2494 #2984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/completion request template 2494 #2984
Conversation
- Add reasoning_token_count field to DeltaGenerator struct - Implement update_reasoning_tokens() method for accurate token counting - Populate completion_tokens_details.reasoning_tokens in usage statistics - Support both streaming and non-streaming responses - Enable observability of reasoning token usage for reasoning models Fixes ai-dynamo#2941 Signed-off-by: Harshith Akkapelli <[email protected]>
Implements request template functionality for /v1/completions endpoint
to match existing /v1/chat/completions behavior. Templates now provide
default values for model, temperature, and max_tokens when not specified
in completion requests.
Changes:
- Modified handler_completions to parse JSON before deserialization
- Added template application logic for completion requests
- Updated completions_router to accept template parameter
- Added --request-template CLI argument to Python frontend
- Template values are applied before JSON validation, allowing
minimal requests like {"prompt": "hello"} to work with templates
Fixes ai-dynamo#2494
Signed-off-by: Harshith Akkapelli [email protected]
|
👋 Hi hakkapelli! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
Signed-off-by: Harshith Akkapelli <[email protected]>
WalkthroughAdds an optional --request-template CLI flag to pass a template file into engine startup, threads RequestTemplate through HTTP routing for completions/chat/responses, applies template defaults to completion requests with improved input error handling, and augments chat streaming with internal reasoning token counting surfaced in usage details. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as Frontend CLI
participant Svc as service_v2 Router
participant OAI as OpenAI Routers
participant H as Completions Handler
participant T as RequestTemplate
participant Eng as Engine Task
User->>CLI: run with --request-template=path.json (optional)
CLI->>Svc: start service with template_file (if provided)
Svc->>OAI: build routers with (state, template)
User->>OAI: POST /v1/completions (JSON body)
OAI->>H: invoke handler with raw body + template from state
H->>H: parse JSON or return 400 "Invalid JSON: …"
H->>T: apply defaults (model, temperature, max_tokens) when missing/zero
H->>H: deserialize to NvCreateCompletionRequest or 400 "Invalid request format: …"
H->>Eng: spawn/await completions task
Eng-->>H: result or error
H-->>User: response
note right of H: Logs "Received completions request: …"
sequenceDiagram
autonumber
participant Gen as DeltaGenerator
participant Stream as Stream Loop
participant Usage as Usage Details
Stream->>Gen: emit delta with content + optional reasoning_text
alt reasoning content present
Gen->>Gen: update_reasoning_tokens(reasoning_text, total_text, token_ids)
end
Stream->>Usage: assemble completion_tokens_details
alt reasoning_token_count > 0
Usage->>Usage: include reasoning_tokens
end
Usage-->>Stream: usage snapshot per choice
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Pre-merge checks (3 passed, 2 warnings)❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (6)
components/frontend/src/dynamo/frontend/main.py (1)
203-208: Validate template path and clarify help text
- Add a validator to ensure the file exists and is JSON.
- Clarify “max_tokens vs max_completion_tokens” to avoid confusion across endpoints.
- parser.add_argument( - "--request-template", - type=pathlib.Path, - default=None, - help="Path to JSON file containing default request parameters (model, temperature, max_completion_tokens).", - ) + parser.add_argument( + "--request-template", + type=validate_request_template_path, + default=None, + help="Path to JSON file with default request parameters (model, temperature, max_tokens/max_completion_tokens).", + )Add this helper (outside the shown hunk) to mirror existing validators:
def validate_request_template_path(value: pathlib.Path): p = pathlib.Path(value) if not p.is_file(): raise argparse.ArgumentTypeError(f"--request-template must point to an existing file, got: {value}") if p.suffix.lower() != ".json": raise argparse.ArgumentTypeError(f"--request-template must be a .json file, got: {value}") return plib/llm/src/protocols/openai/chat_completions/delta.rs (2)
138-158: Make token accounting safer (avoid overflows, skip zero-token deltas)
- Skip computation when token_ids is empty to save work.
- Use saturating_add for defensive accumulation.
- fn update_reasoning_tokens(&mut self, reasoning_text: &str, total_text: &str, token_ids: &[u32]) { - if !reasoning_text.is_empty() && !total_text.is_empty() { + fn update_reasoning_tokens(&mut self, reasoning_text: &str, total_text: &str, token_ids: &[u32]) { + if token_ids.is_empty() { return; } + if !reasoning_text.is_empty() && !total_text.is_empty() { // Estimate reasoning tokens based on the proportion of reasoning text to total text // This is more accurate than word counting alone let reasoning_chars = reasoning_text.chars().count() as f32; let total_chars = total_text.chars().count() as f32; if total_chars > 0.0 { let reasoning_proportion = reasoning_chars / total_chars; let estimated_reasoning_tokens = (token_ids.len() as f32 * reasoning_proportion).ceil() as u32; - self.reasoning_token_count += estimated_reasoning_tokens; + self.reasoning_token_count = self.reasoning_token_count.saturating_add(estimated_reasoning_tokens); } } }
378-383: Micro-guard to avoid unnecessary work when no tokens emittedSaves char counting on empty-token deltas.
- if let Some(ref reasoning_text) = reasoning_content { - let total_text = delta.text.as_deref().unwrap_or(""); - self.update_reasoning_tokens(reasoning_text, total_text, &delta.token_ids); - } + if let Some(ref reasoning_text) = reasoning_content { + if !delta.token_ids.is_empty() { + let total_text = delta.text.as_deref().unwrap_or(""); + self.update_reasoning_tokens(reasoning_text, total_text, &delta.token_ids); + } + }lib/llm/src/http/service/openai.rs (3)
214-217: Prefer Bytes extractor to avoid extra copy and enable zero-copy JSON parsingUsing
Stringforces an extra allocation and UTF-8 validation.Bytes+from_sliceis leaner.-async fn handler_completions( - State((state, template)): State<(Arc<service_v2::State>, Option<RequestTemplate>)>, - headers: HeaderMap, - body: String, -) -> Result<Response, ErrorResponse> { +async fn handler_completions( + State((state, template)): State<(Arc<service_v2::State>, Option<RequestTemplate>)>, + headers: HeaderMap, + body: axum::body::Bytes, +) -> Result<Response, ErrorResponse> {
221-231: Parse from bytes, not stringFollow-up to the
Byteschange: useserde_json::from_slice.- let mut json_value: serde_json::Value = serde_json::from_str(&body) + let mut json_value: serde_json::Value = serde_json::from_slice(&body) .map_err(|e| { ( StatusCode::BAD_REQUEST, Json(ErrorMessage { error: format!("Invalid JSON: {}", e), }), ) })?;
292-292: Unused template in completions()
_templateisn’t used. Either drop the parameter and the corresponding argument at the call site, or leave as-is if you plan to use it soon. Minor clarity win to remove now.-async fn completions( - state: Arc<service_v2::State>, - _template: Option<RequestTemplate>, +async fn completions( + state: Arc<service_v2::State>, request: Context<NvCreateCompletionRequest>, stream_handle: ConnectionHandle, ) -> Result<Response, ErrorResponse> {And at the call site:
- let response = tokio::spawn(completions(state, template, request, stream_handle).in_current_span()) + let response = tokio::spawn(completions(state, request, stream_handle).in_current_span())
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/frontend/src/dynamo/frontend/main.py(2 hunks)lib/llm/src/http/service/openai.rs(4 hunks)lib/llm/src/http/service/service_v2.rs(1 hunks)lib/llm/src/protocols/openai/chat_completions/delta.rs(5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/delta.rs
🧬 Code graph analysis (2)
components/frontend/src/dynamo/frontend/main.py (1)
lib/llm/src/local_model.rs (3)
default(67-89)request_template(154-157)request_template(350-352)
lib/llm/src/http/service/service_v2.rs (2)
lib/llm/src/http/service/openai.rs (1)
completions_router(1067-1079)lib/llm/src/local_model.rs (2)
request_template(154-157)request_template(350-352)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2984/merge) by hakkapelli.
lib/llm/src/http/service/openai.rs
[error] 228-236: Trailing whitespace detected by pre-commit 'trailing-whitespace' hook; file was auto-fixed. CI exited with code 1.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (6)
components/frontend/src/dynamo/frontend/main.py (1)
271-272: No change needed fortemplate_filePyo3’sOption<PathBuf>parameters accept Pythonstr,os.PathLike, andpathlib.Pathnatively, so passingflags.request_templatedirectly is safe and casting tostris unnecessary. (pyo3.rs)lib/llm/src/protocols/openai/chat_completions/delta.rs (1)
60-61: LGTM: field added and initializedThe dedicated reasoning_token_count with zero-init is straightforward and contained.
Also applies to: 123-127
lib/llm/src/http/service/service_v2.rs (1)
371-375: LGTM – RequestTemplate passed through completions_router for parity. Matches chat/responses routing and enables templated defaults on /v1/completions; no remaining outdated call sites found.lib/llm/src/http/service/openai.rs (3)
253-262: Clear 400 mapping for deserialization errors — looks goodThe “Invalid request format: …” 400 is appropriate after templating and before engine invocation.
273-281: Tokio spawn with span propagation — LGTMSpawning the task with
in_current_span()and a tailored await error message is correct.
1069-1077: Router state wiring with template — LGTMPassing
(state, template)and setting a body limit here aligns with the handler signature change.
| // Apply template values to JSON before deserialization | ||
| if let Some(template) = &template { | ||
| if let Some(obj) = json_value.as_object_mut() { | ||
| // Apply model from template if not present or empty | ||
| if !obj.contains_key("model") || obj.get("model").and_then(|v| v.as_str()).map_or(true, |s| s.is_empty()) { | ||
| obj.insert("model".to_string(), serde_json::Value::String(template.model.clone())); | ||
| } | ||
|
|
||
| // Apply temperature from template if not present or zero | ||
| if !obj.contains_key("temperature") || obj.get("temperature").and_then(|v| v.as_f64()).map_or(true, |t| t == 0.0) { | ||
| obj.insert("temperature".to_string(), serde_json::Value::Number(serde_json::Number::from_f64(template.temperature as f64).unwrap())); | ||
| } | ||
|
|
||
| // Apply max_tokens from template if not present or zero | ||
| if !obj.contains_key("max_tokens") || obj.get("max_tokens").and_then(|v| v.as_u64()).map_or(true, |t| t == 0) { | ||
| obj.insert("max_tokens".to_string(), serde_json::Value::Number(serde_json::Number::from(template.max_completion_tokens))); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Template application: avoid overriding explicit values and remove potential panic
- Don’t override user-specified
temperature/max_tokenswhen explicitly set to 0; 0.0 can be a deliberate choice. Current chat path treats 0.0 as “unset”; confirm if that coupling is intentional. If not, only apply when the field is missing or null. Number::from_f64(...).unwrap()can panic for NaN/Infinity. Handle None defensively.
- if let Some(obj) = json_value.as_object_mut() {
- // Apply model from template if not present or empty
- if !obj.contains_key("model") || obj.get("model").and_then(|v| v.as_str()).map_or(true, |s| s.is_empty()) {
- obj.insert("model".to_string(), serde_json::Value::String(template.model.clone()));
- }
-
- // Apply temperature from template if not present or zero
- if !obj.contains_key("temperature") || obj.get("temperature").and_then(|v| v.as_f64()).map_or(true, |t| t == 0.0) {
- obj.insert("temperature".to_string(), serde_json::Value::Number(serde_json::Number::from_f64(template.temperature as f64).unwrap()));
- }
-
- // Apply max_tokens from template if not present or zero
- if !obj.contains_key("max_tokens") || obj.get("max_tokens").and_then(|v| v.as_u64()).map_or(true, |t| t == 0) {
- obj.insert("max_tokens".to_string(), serde_json::Value::Number(serde_json::Number::from(template.max_completion_tokens)));
- }
- }
+ if let Some(obj) = json_value.as_object_mut() {
+ // model: apply if missing, null, or blank
+ if obj
+ .get("model")
+ .and_then(|v| v.as_str())
+ .map(|s| s.trim().is_empty())
+ .unwrap_or(true)
+ {
+ obj.insert("model".into(), serde_json::Value::String(template.model.clone()));
+ }
+
+ // temperature: apply only if missing or null (do not override explicit 0.0)
+ if obj.get("temperature").map(|v| v.is_null()).unwrap_or(true) {
+ if let Some(num) = serde_json::Number::from_f64(template.temperature as f64) {
+ obj.insert("temperature".into(), serde_json::Value::Number(num));
+ } else {
+ tracing::warn!("Ignoring non-finite template.temperature={}", template.temperature);
+ }
+ }
+
+ // max_tokens: apply only if missing or null (do not override explicit 0)
+ if obj.get("max_tokens").map(|v| v.is_null()).unwrap_or(true) {
+ obj.insert(
+ "max_tokens".into(),
+ serde_json::Value::Number(serde_json::Number::from(template.max_completion_tokens as u64)),
+ );
+ }
+ }Avoid overriding explicit zero values and guard potential panic
- Apply
temperatureandmax_tokensonly when the fields are missing or null—an explicit0.0or0may be a deliberate choice. - Replace
.unwrap()onserde_json::Number::from_f64(…)(which returnsNonefor NaN/Infinity) with anif let Some(num) = …or equivalent fallback and log or error on non-finite values (to prevent panics). (docs.rs)
🤖 Prompt for AI Agents
lib/llm/src/http/service/openai.rs around lines 232-250: change the template
application logic so temperature and max_tokens are only applied when the JSON
object does not contain the field or the field is null (do not treat explicit
0.0 or 0 as absent), and replace the serde_json::Number::from_f64(...).unwrap()
with a safe handling pattern (e.g. if let Some(num) = Number::from_f64(...) {
insert num } else { log or return an error/skip insertion }) to avoid panics on
NaN/Infinity.
| check_ready(&state)?; | ||
|
|
||
| let request_id = request.id().to_string(); | ||
| tracing::trace!("Received completions request: {:?}", request.inner); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid logging full request bodies (PII/secrets risk)
{:?} request.inner can leak prompts and user data. Log only safe metadata (e.g., request_id/model/stream flags) as done in chat with content()-style logging.
- tracing::trace!("Received completions request: {:?}", request.inner);
+ // Avoid logging prompts/PII; keep metadata-only.
+ tracing::trace!("Received completions request: request_id={}", request_id);If you want model/stream info too, move this log below the model and streaming derivations and log:
tracing::trace!("Received completions request: request_id={}, model={}, streaming={}", request_id, model, streaming);🤖 Prompt for AI Agents
In lib/llm/src/http/service/openai.rs around line 300, the current trace log
prints the full request body (tracing::trace!("Received completions request:
{:?}", request.inner)), which can leak prompts/PII; replace this by removing the
full request dump and instead log only safe metadata after you derive
request_id, model, and streaming: move the trace call to below those derivations
and emit a trace like "Received completions request: request_id={}, model={},
streaming={}" (populate with the local variables) so no request.inner or prompt
content is logged.
| // Update completion_tokens_details with reasoning tokens if we have any | ||
| if self.reasoning_token_count > 0 { | ||
| // Preserve existing completion_tokens_details if they exist, or create new ones | ||
| let mut details = usage.completion_tokens_details.unwrap_or_default(); | ||
| details.reasoning_tokens = Some(self.reasoning_token_count); | ||
| usage.completion_tokens_details = Some(details); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Clamp reasoning_tokens to completion_tokens
Prevents reporting more reasoning_tokens than completion_tokens due to proportional rounding across chunks.
- // Update completion_tokens_details with reasoning tokens if we have any
- if self.reasoning_token_count > 0 {
- // Preserve existing completion_tokens_details if they exist, or create new ones
- let mut details = usage.completion_tokens_details.unwrap_or_default();
- details.reasoning_tokens = Some(self.reasoning_token_count);
- usage.completion_tokens_details = Some(details);
- }
+ // Update completion_tokens_details with reasoning tokens if we have any
+ if self.reasoning_token_count > 0 {
+ let mut details = usage.completion_tokens_details.unwrap_or_default();
+ let clamped = self.reasoning_token_count.min(usage.completion_tokens);
+ details.reasoning_tokens = Some(clamped);
+ usage.completion_tokens_details = Some(details);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Update completion_tokens_details with reasoning tokens if we have any | |
| if self.reasoning_token_count > 0 { | |
| // Preserve existing completion_tokens_details if they exist, or create new ones | |
| let mut details = usage.completion_tokens_details.unwrap_or_default(); | |
| details.reasoning_tokens = Some(self.reasoning_token_count); | |
| usage.completion_tokens_details = Some(details); | |
| } | |
| // Update completion_tokens_details with reasoning tokens if we have any | |
| if self.reasoning_token_count > 0 { | |
| let mut details = usage.completion_tokens_details.unwrap_or_default(); | |
| let clamped = self.reasoning_token_count.min(usage.completion_tokens); | |
| details.reasoning_tokens = Some(clamped); | |
| usage.completion_tokens_details = Some(details); | |
| } |
🤖 Prompt for AI Agents
In lib/llm/src/protocols/openai/chat_completions/delta.rs around lines 285 to
291, clamp reasoning_tokens so it never exceeds completion_tokens: when building
or updating completion_tokens_details, read completion_tokens from usage
(fallback 0 if None), compute the minimum of self.reasoning_token_count and that
completion_tokens value, and assign that min value to details.reasoning_tokens
before storing details back on usage.
| State((state, template)): State<(Arc<service_v2::State>, Option<RequestTemplate>)>, | ||
| headers: HeaderMap, | ||
| Json(request): Json<NvCreateCompletionRequest>, | ||
| body: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
Could you use the same approach as chat completions, modifying request.inner.temperature etc?
| kwargs["namespace"] = flags.namespace | ||
| if flags.request_template: | ||
| kwargs["template_file"] = flags.request_template | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the rest of the PR doesn't land, I would be happy for us to take just this file. It brings the python CLI closer to dynamo-run (https://github.com/ai-dynamo/dynamo/blob/main/launch/dynamo-run/src/flags.rs#L139), and that flag is already useful for chat completions.
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information. |
Overview:
Adds request template support to
/v1/completionsendpoint to match existing/v1/chat/completionsfunctionality. Templates now provide default values for model, temperature, and max_tokens when not specified in completion requests.Details:
handler_completionsto parse JSON before deserialization and apply template valuescompletions_routerto accept template parameter--request-templateCLI argument to Python frontend{"prompt": "hello"}Where should the reviewer start?
lib/llm/src/http/service/openai.rs- Main template application logic inhandler_completionscomponents/frontend/src/dynamo/frontend/main.py- CLI argument additionRelated Issues:
Summary by CodeRabbit
New Features
Bug Fixes