Skip to content

fix(sglang): use incremental streaming output for completions#7752

Open
weireweire wants to merge 5 commits intoai-dynamo:mainfrom
weireweire:fix/sglang-incremental-completions-usage
Open

fix(sglang): use incremental streaming output for completions#7752
weireweire wants to merge 5 commits intoai-dynamo:mainfrom
weireweire:fix/sglang-incremental-completions-usage

Conversation

@weireweire
Copy link
Copy Markdown
Contributor

@weireweire weireweire commented Apr 1, 2026

Summary

Fix Dynamo's SGLang /v1/completions streaming assumptions on current SGLang main.

  • switch the SGLang integration to set incremental_streaming_output = True
  • prefer the backend-reported final completion_usage.completion_tokens when building final completions usage

Why

Dynamo still set server_args.stream_output = True, but current SGLang gates disjoint streaming chunks behind incremental_streaming_output.

Relevant SGLang change:

  • Rename --stream-output to --incremental-streaming-output (#20614)

Related SGLang follow-up that makes the incremental/cumulative split explicit:

  • Scope streaming backlog coalescing to incremental_streaming_output mode (#21037)

Without this update, Dynamo can mis-handle cumulative streaming output on /v1/completions, which can in turn skew usage.completion_tokens and downstream benchmark metrics.

Validation

  • python3 -m py_compile components/src/dynamo/sglang/args.py
  • cargo fmt --check --manifest-path lib/llm/Cargo.toml --all

Notes

I also attempted an end-to-end mounted run against the local dynamo checkout, but the container-side local source install path still needs extra environment work unrelated to these code changes.

Summary by CodeRabbit

  • Bug Fixes
    • Improved accuracy of token counting in completion responses by properly updating completion token metrics from the worker instead of relying solely on accumulated token data.

@weireweire weireweire requested review from a team as code owners April 1, 2026 06:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

👋 Hi weireweire! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added fix external-contribution Pull request is from an external contributor backend::sglang Relates to the sglang backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Apr 1, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

Walkthrough

Two changes across streaming configuration and token usage tracking. First change modifies SGLang argument parsing to use incremental_streaming_output instead of stream_output. Second change updates completion usage tracking to source both token types from worker-provided data when available.

Changes

Cohort / File(s) Summary
SGLang Streaming Configuration
components/src/dynamo/sglang/args.py
Modified argument parsing to set incremental_streaming_output = True instead of stream_output = True for controlling SGLang's disjoint-segment streaming behavior.
Token Usage Tracking
lib/llm/src/protocols/openai/completions/delta.rs
Enhanced completion usage handling to update both prompt_tokens and completion_tokens from worker-provided completion_usage when available, instead of deriving completion_tokens solely from accumulated token IDs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: switching SGLang to use incremental streaming output for completions, which aligns with the primary objective of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description covers all required template sections: Overview/Summary, Details of changes, file callouts, and related issues with references.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/llm/src/protocols/openai/completions/delta.rs (1)

283-291: Propagate completion_tokens_details when using backend-provided completion usage.

Line 286 now trusts backend completion_tokens, but only prompt_tokens_details is copied. If backend sends completion_tokens_details, it’s currently dropped.

Suggested patch
         if let Some(completion_usage) = delta.completion_usage.as_ref() {
             // Update prompt_tokens from worker if provided (e.g., for embeddings)
             self.usage.prompt_tokens = completion_usage.prompt_tokens;
             self.usage.completion_tokens = completion_usage.completion_tokens;
 
+            // Propagate completion token details if provided
+            if let Some(completion_details) = completion_usage.completion_tokens_details.as_ref() {
+                self.usage.completion_tokens_details = Some(completion_details.clone());
+            }
+
             // Propagate prompt token details if provided
             if let Some(prompt_details) = completion_usage.prompt_tokens_details.as_ref() {
                 self.usage.prompt_tokens_details = Some(prompt_details.clone());
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/protocols/openai/completions/delta.rs` around lines 283 - 291,
The code updates self.usage from delta.completion_usage but only propagates
prompt_tokens_details; also copy completion_tokens_details when completion_usage
provides it. In the block handling delta.completion_usage (look for
delta.completion_usage.as_ref(), self.usage.prompt_tokens,
self.usage.completion_tokens and the prompt_tokens_details branch), add an
analogous branch that sets self.usage.completion_tokens_details =
Some(completion_tokens_details.clone()) when
completion_usage.completion_tokens_details.is_some(), ensuring backend-provided
completion token detail objects are not dropped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/llm/src/protocols/openai/completions/delta.rs`:
- Around line 283-291: The code updates self.usage from delta.completion_usage
but only propagates prompt_tokens_details; also copy completion_tokens_details
when completion_usage provides it. In the block handling delta.completion_usage
(look for delta.completion_usage.as_ref(), self.usage.prompt_tokens,
self.usage.completion_tokens and the prompt_tokens_details branch), add an
analogous branch that sets self.usage.completion_tokens_details =
Some(completion_tokens_details.clone()) when
completion_usage.completion_tokens_details.is_some(), ensuring backend-provided
completion token detail objects are not dropped.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 71a1f921-dace-43db-9c66-f0fe7a4504c0

📥 Commits

Reviewing files that changed from the base of the PR and between ab5a31b and 865ae87.

📒 Files selected for processing (2)
  • components/src/dynamo/sglang/args.py
  • lib/llm/src/protocols/openai/completions/delta.rs

@rmccorm4
Copy link
Copy Markdown
Contributor

rmccorm4 commented Apr 1, 2026

Hi @weireweire , a similar PR was merged last night: #7642

Is this one still needed? Do you want to make a simpler PR for the completion_token detail usage change?

@nvpohanh
Copy link
Copy Markdown

nvpohanh commented Apr 2, 2026

Superceded by #7642

We can close this

@weireweire
Copy link
Copy Markdown
Contributor Author

I think we can still merge this, as it's better to make all the compatible logic to _compat.py

@weireweire weireweire force-pushed the fix/sglang-incremental-completions-usage branch from bdad4c2 to af37781 Compare April 2, 2026 02:22
@weireweire
Copy link
Copy Markdown
Contributor Author

rebased, please review

@nvpohanh
Copy link
Copy Markdown

nvpohanh commented Apr 2, 2026

@rmccorm4 could you review this? thanks

@rmccorm4
Copy link
Copy Markdown
Contributor

rmccorm4 commented Apr 2, 2026

Hi @weireweire, please fix the failing checks

@rmccorm4
Copy link
Copy Markdown
Contributor

rmccorm4 commented Apr 2, 2026

/ok to test 4c873b4

Weiliangl User added 4 commits April 3, 2026 02:56
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
Signed-off-by: Weiliangl User <weiliangl@login-node.hosted.internal>
@weireweire weireweire force-pushed the fix/sglang-incremental-completions-usage branch from 984b518 to e9239fc Compare April 3, 2026 02:56
@weireweire
Copy link
Copy Markdown
Contributor Author

@rmccorm4 fixed

@rmccorm4
Copy link
Copy Markdown
Contributor

rmccorm4 commented Apr 3, 2026

/ok to test e9239fc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::sglang Relates to the sglang backend external-contribution Pull request is from an external contributor fix frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants