Skip to content

fix(sglang): use incremental_streaming_output instead of deprecated stream_output#7642

Merged
rmccorm4 merged 2 commits intoai-dynamo:mainfrom
YAMY1234:fix/sglang-incremental-streaming-output
Apr 1, 2026
Merged

fix(sglang): use incremental_streaming_output instead of deprecated stream_output#7642
rmccorm4 merged 2 commits intoai-dynamo:mainfrom
YAMY1234:fix/sglang-incremental-streaming-output

Conversation

@YAMY1234
Copy link
Copy Markdown
Contributor

@YAMY1234 YAMY1234 commented Mar 25, 2026

Overview:

use incremental_streaming_output instead of deprecated stream_output

Details:

sglang renamed stream_output to incremental_streaming_output in sglang/sglang#20614. The old attribute assignment silently became a no-op, causing cumulative output_ids to be sent instead of disjoint deltas. This led to triangular-sum inflation of completion_tokens (~10x).

Before the fix:

============ Serving Benchmark Result ============
Successful requests:                     8         
Benchmark duration (s):                  22.07     
Total input tokens:                      8000      
Total generated tokens:                  84160     
Request throughput (req/s):              0.36      
Output token throughput (tok/s):         3813.62   
Total Token throughput (tok/s):          4176.14   
---------------Time to First Token----------------
Mean TTFT (ms):                          780.26    
Median TTFT (ms):                        775.64    
P99 TTFT (ms):                           817.16    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.97      
Median TPOT (ms):                        0.97      
P99 TPOT (ms):                           1.01      
---------------Inter-token Latency----------------
Mean ITL (ms):                           512.30    
Median ITL (ms):                         487.04    
P99 ITL (ms):                            1111.95   
----------------End-to-end Latency----------------
Mean E2EL (ms):                          11026.20  
Median E2EL (ms):                        11024.14  
P99 E2EL (ms):                           11437.12  
==================================================

After the fix:

============ Serving Benchmark Result ============
Successful requests:                     8         
Benchmark duration (s):                  20.50     
Total input tokens:                      7302      
Total generated tokens:                  7093      
Request throughput (req/s):              0.39      
Output token throughput (tok/s):         346.04    
Total Token throughput (tok/s):          702.28    
---------------Time to First Token----------------
Mean TTFT (ms):                          984.03    
Median TTFT (ms):                        771.13    
P99 TTFT (ms):                           1556.39   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.02     
Median TPOT (ms):                        9.80      
P99 TPOT (ms):                           10.84     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.00     
Median ITL (ms):                         9.75      
P99 ITL (ms):                            10.33     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          9839.03   
Median E2EL (ms):                        9963.85   
P99 E2EL (ms):                           10276.21  
==================================================

Summary by CodeRabbit

  • Bug Fixes
    • Updated streaming configuration for SGLang server compatibility.

@YAMY1234 YAMY1234 requested review from a team as code owners March 25, 2026 23:41
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi YAMY1234! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added backend::sglang Relates to the sglang backend external-contribution Pull request is from an external contributor fix and removed fix labels Mar 25, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

Walkthrough

Updated SGLang ServerArgs streaming configuration in the parse_args function by replacing the stream_output flag with incremental_streaming_output and adjusting the corresponding comment. No control flow or other logic modifications.

Changes

Cohort / File(s) Summary
SGLang Configuration Update
components/src/dynamo/sglang/args.py
Replaced server_args.stream_output = True with server_args.incremental_streaming_output = True in the argument parsing logic, with comment adjustment to reflect the renamed configuration parameter.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: using incremental_streaming_output instead of the deprecated stream_output in the sglang configuration.
Description check ✅ Passed The PR description includes an overview, detailed explanation of the problem (deprecated attribute causing no-op behavior), the fix applied, and comprehensive before/after benchmark results demonstrating the impact.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
components/src/dynamo/sglang/args.py (1)

372-373: Optional follow-up: align stale handler docstrings with the renamed flag.

You updated the config flag here, but related docs/comments still mention stream_output=True (for example in components/src/dynamo/sglang/request_handlers/llm/decode_handler.py and components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py). Renaming those to incremental_streaming_output would reduce future confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/sglang/args.py` around lines 372 - 373, Update
docstrings/comments that still reference the old flag name `stream_output=True`
to the new name `incremental_streaming_output` so they match the renamed config
in args.py; specifically search for occurrences in request handler docstrings
(e.g., in the decode handler function/class in
request_handlers/llm/decode_handler.py and the worker handler in
request_handlers/multimodal/worker_handler.py) and replace the wording (and any
example parameter usage) to use `incremental_streaming_output` while preserving
the original explanatory text and examples.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@components/src/dynamo/sglang/args.py`:
- Around line 372-373: Update docstrings/comments that still reference the old
flag name `stream_output=True` to the new name `incremental_streaming_output` so
they match the renamed config in args.py; specifically search for occurrences in
request handler docstrings (e.g., in the decode handler function/class in
request_handlers/llm/decode_handler.py and the worker handler in
request_handlers/multimodal/worker_handler.py) and replace the wording (and any
example parameter usage) to use `incremental_streaming_output` while preserving
the original explanatory text and examples.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a6fd0a7-008b-4757-9b52-3d856f278181

📥 Commits

Reviewing files that changed from the base of the PR and between c7b2d2b and 785e5ac.

📒 Files selected for processing (1)
  • components/src/dynamo/sglang/args.py

@rmccorm4
Copy link
Copy Markdown
Contributor

Please sign your commit to pass DCO check

We can also get you added to the Dynamo github team so that this check will pass automatically for you in the future - please reach out on Slack for this

…tream_output

sglang renamed `stream_output` to `incremental_streaming_output` in
sglang/sglang#20614. The old attribute assignment silently became a no-op,
causing cumulative output_ids to be sent instead of disjoint deltas.
This led to triangular-sum inflation of completion_tokens (~10x).

Signed-off-by: Yangmin Li <yangminl@nvidia.com>
@YAMY1234 YAMY1234 force-pushed the fix/sglang-incremental-streaming-output branch from 785e5ac to c5f81b5 Compare March 25, 2026 23:51
@YAMY1234
Copy link
Copy Markdown
Contributor Author

Added, thanks! @rmccorm4

@nvpohanh
Copy link
Copy Markdown

nvpohanh commented Apr 1, 2026

@rmccorm4 When could we merge this? We urgently need this. Thanks!

@rmccorm4 rmccorm4 enabled auto-merge (squash) April 1, 2026 07:52
Copy link
Copy Markdown
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after this comment is addressed so we don't break in between versions: #7642 (comment)

@rmccorm4
Copy link
Copy Markdown
Contributor

rmccorm4 commented Apr 1, 2026

/ok to test c5f81b5

sglang renamed `stream_output` to `incremental_streaming_output` in
sglang/sglang#20614 (after v0.5.9). Use hasattr to detect which field
exists so the fix works on both old and new sglang versions.

Signed-off-by: Yangmin Li <yangminl@nvidia.com>
@pull-request-size pull-request-size bot added size/S and removed size/XS labels Apr 1, 2026
@rmccorm4
Copy link
Copy Markdown
Contributor

rmccorm4 commented Apr 1, 2026

/ok to test 9fed398

@rmccorm4 rmccorm4 merged commit 8fe2082 into ai-dynamo:main Apr 1, 2026
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::sglang Relates to the sglang backend external-contribution Pull request is from an external contributor fix size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants