fix(sglang): use incremental streaming output for completions#7752
fix(sglang): use incremental streaming output for completions#7752weireweire wants to merge 5 commits intoai-dynamo:mainfrom
Conversation
|
👋 Hi weireweire! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughTwo changes across streaming configuration and token usage tracking. First change modifies SGLang argument parsing to use Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/llm/src/protocols/openai/completions/delta.rs (1)
283-291: Propagatecompletion_tokens_detailswhen using backend-provided completion usage.Line 286 now trusts backend
completion_tokens, but onlyprompt_tokens_detailsis copied. If backend sendscompletion_tokens_details, it’s currently dropped.Suggested patch
if let Some(completion_usage) = delta.completion_usage.as_ref() { // Update prompt_tokens from worker if provided (e.g., for embeddings) self.usage.prompt_tokens = completion_usage.prompt_tokens; self.usage.completion_tokens = completion_usage.completion_tokens; + // Propagate completion token details if provided + if let Some(completion_details) = completion_usage.completion_tokens_details.as_ref() { + self.usage.completion_tokens_details = Some(completion_details.clone()); + } + // Propagate prompt token details if provided if let Some(prompt_details) = completion_usage.prompt_tokens_details.as_ref() { self.usage.prompt_tokens_details = Some(prompt_details.clone()); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/protocols/openai/completions/delta.rs` around lines 283 - 291, The code updates self.usage from delta.completion_usage but only propagates prompt_tokens_details; also copy completion_tokens_details when completion_usage provides it. In the block handling delta.completion_usage (look for delta.completion_usage.as_ref(), self.usage.prompt_tokens, self.usage.completion_tokens and the prompt_tokens_details branch), add an analogous branch that sets self.usage.completion_tokens_details = Some(completion_tokens_details.clone()) when completion_usage.completion_tokens_details.is_some(), ensuring backend-provided completion token detail objects are not dropped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@lib/llm/src/protocols/openai/completions/delta.rs`:
- Around line 283-291: The code updates self.usage from delta.completion_usage
but only propagates prompt_tokens_details; also copy completion_tokens_details
when completion_usage provides it. In the block handling delta.completion_usage
(look for delta.completion_usage.as_ref(), self.usage.prompt_tokens,
self.usage.completion_tokens and the prompt_tokens_details branch), add an
analogous branch that sets self.usage.completion_tokens_details =
Some(completion_tokens_details.clone()) when
completion_usage.completion_tokens_details.is_some(), ensuring backend-provided
completion token detail objects are not dropped.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 71a1f921-dace-43db-9c66-f0fe7a4504c0
📒 Files selected for processing (2)
components/src/dynamo/sglang/args.pylib/llm/src/protocols/openai/completions/delta.rs
|
Hi @weireweire , a similar PR was merged last night: #7642 Is this one still needed? Do you want to make a simpler PR for the completion_token detail usage change? |
|
Superceded by #7642 We can close this |
|
I think we can still merge this, as it's better to make all the compatible logic to _compat.py |
bdad4c2 to
af37781
Compare
|
rebased, please review |
|
@rmccorm4 could you review this? thanks |
|
Hi @weireweire, please fix the failing checks |
|
/ok to test 4c873b4 |
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
984b518 to
e9239fc
Compare
|
@rmccorm4 fixed |
|
/ok to test e9239fc |
Summary
Fix Dynamo's SGLang
/v1/completionsstreaming assumptions on current SGLang main.incremental_streaming_output = Truecompletion_usage.completion_tokenswhen building final completions usageWhy
Dynamo still set
server_args.stream_output = True, but current SGLang gates disjoint streaming chunks behindincremental_streaming_output.Relevant SGLang change:
Rename --stream-output to --incremental-streaming-output(#20614)Related SGLang follow-up that makes the incremental/cumulative split explicit:
Scope streaming backlog coalescing to incremental_streaming_output mode(#21037)Without this update, Dynamo can mis-handle cumulative streaming output on
/v1/completions, which can in turn skewusage.completion_tokensand downstream benchmark metrics.Validation
python3 -m py_compile components/src/dynamo/sglang/args.pycargo fmt --check --manifest-path lib/llm/Cargo.toml --allNotes
I also attempted an end-to-end mounted run against the local
dynamocheckout, but the container-side local source install path still needs extra environment work unrelated to these code changes.Summary by CodeRabbit