fix(sglang): stop forcing incremental_streaming_output to fix high-concurrency throughput regression by ishandhanani · Pull Request #7910 · ai-dynamo/dynamo

ishandhanani · 2026-04-06T17:54:01Z

Summary

Stop forcing incremental_streaming_output=True (or stream_output=True) on SGLang's ServerArgs
Restore cumulative token slicing in decode handler and multimodal stream processor
Fixes ~2x throughput regression in disaggregated PD serving at high concurrency

Root Cause

Dynamo v0.9.0 introduced forced stream_output=True (commit 748fee6, PR #5510), which switches SGLang's tokenizer_manager into incremental streaming mode. In this mode, every streaming chunk carries only a delta and must be individually yielded -- the tokenizer_manager cannot coalesce intermediate chunks without losing tokens.

Under high concurrency (>=2048 concurrent requests), this creates backpressure in the tokenizer_manager's ZMQ path:

Metric	v0.8.1 (cumulative, good)	v0.9.0+ (incremental, bad)
Throughput (C=4096)	132,140 tok/s	69,767 tok/s (-47%)
TTFT (C=2048)	1,301 ms	10,215 ms (+7.9x)
Prefill `#inflight-req`	0	17-18
Decode `#running-req`	77-166	36-44

This was identified as a perf regression by @sechoi in commit 049daef ("Make sglang stream_output optional to resolve perf regression") which was never merged from its feature branch.

Controlled experiments by @YAMY1234 confirmed that the same SGLang version (dev-0401) performs at full speed with Dynamo 0.8.1 (no forced incremental streaming) and at half speed with Dynamo 1.0.0 (forced incremental streaming). SGLang itself did not regress.

Fix

Restore v0.8.1 behavior:

args.py: Remove forced incremental_streaming_output=True / stream_output=True. Leave SGLang in its default cumulative output mode.
decode_handler.py: Restore cumulative slicing (output_ids[num_output_tokens_so_far:]) to extract disjoint deltas from cumulative output. This gives correct token statistics without the tokenizer_manager overhead of incremental mode.
worker_handler.py: Same cumulative slicing fix for multimodal StreamProcessor.

Test plan

Verify token counts are correct (no triangular-sum inflation) with sglang_bench or sa-bench
Verify throughput at high concurrency (C>=2048) matches v0.8.1 baseline in disagg PD setup
Run existing CI (agg + disagg integration tests)

Ref: sgl-project/sglang#22095
Related: #7642, #7752

Summary by CodeRabbit

Bug Fixes
- Updated token streaming to use cumulative mode by default instead of forcing streaming output.
- Improved token processing to correctly yield only newly generated tokens in each response chunk, enhancing consistency across response handlers.

coderabbitai · 2026-04-06T17:56:50Z

Walkthrough

This pull request modifies SGLang integration in Dynamo by removing forced streaming mode initialization and updating stream token processing to treat output_ids as cumulative token sequences rather than per-chunk increments across handler implementations.

Changes

Cohort / File(s)	Summary
Streaming Configuration `components/src/dynamo/sglang/args.py`	Removed logic that forcibly enabled SGLang streaming modes (`incremental_streaming_output` or `stream_output`); reverts to default cumulative mode behavior.
Stream Token Processors `components/src/dynamo/sglang/request_handlers/llm/decode_handler.py`, `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py`	Updated `_process_token_stream` and `process_sglang_stream` to handle `output_ids` as cumulative sequences; added tracking (`num_output_tokens_so_far`) and slicing logic to extract only new tokens per chunk while preserving existing skip/finish-reason behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: stopping forced incremental_streaming_output to fix a throughput regression.
Description check	✅ Passed	The description is comprehensive, covering summary, root cause analysis, fix details, and test plan. It fully exceeds the basic template requirements with technical depth and context.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/sglang/args.py`:
- Around line 376-382: Add a validation in the server args parsing to reject
SGLang incremental streaming mode: check getattr(server_args, "stream_output",
False) and if true raise a ValueError with a message that incremental streaming
(--incremental-streaming-output) is not supported because handlers expect
cumulative output_ids; follow the existing guard pattern used for
schedule_low_priority_values_first to place this check near where server_args is
validated so decode and multimodal handlers are protected.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c3700f9a-5264-443c-b1d8-c5bf4672529b

📥 Commits

Reviewing files that changed from the base of the PR and between 0ba80f6 and 901572c.

📒 Files selected for processing (3)

components/src/dynamo/sglang/args.py
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py

components/src/dynamo/sglang/args.py

…ncurrency throughput regression Dynamo v0.9.0+ forces stream_output=True (later incremental_streaming_output=True) on SGLang's ServerArgs. This switches SGLang's tokenizer_manager into incremental mode where every streaming chunk must be individually yielded -- no coalescing of intermediate chunks is possible without data loss. Under high concurrency (>=2048), this creates backpressure in the tokenizer_manager's ZMQ path, causing: - ~2x throughput regression in disaggregated PD serving - ~8x TTFT inflation at concurrency=2048 - Prefill #inflight-req stuck at 17-18 (vs 0 in good case) - Decode #running-req 3-4x fewer (starved) The fix restores v0.8.1 behavior: - Do not set incremental_streaming_output or stream_output on ServerArgs - Leave SGLang in its default cumulative output mode - Slice new tokens from cumulative output_ids in the decode handler (output_ids[num_output_tokens_so_far:]) to yield correct disjoint deltas This gives correct token statistics without the tokenizer_manager overhead. Same fix applied to the multimodal StreamProcessor. Ref: sgl-project/sglang#22095

ishandhanani · 2026-04-07T19:24:26Z

This one is not needed

ishandhanani requested review from a team as code owners April 6, 2026 17:54

pull-request-size bot added the size/M label Apr 6, 2026

github-actions bot added fix backend::sglang Relates to the sglang backend multimodal labels Apr 6, 2026

ishandhanani force-pushed the idhanani/fix-streaming-perf-regression branch from c631c92 to 901572c Compare April 6, 2026 17:54

copy-pr-bot bot temporarily deployed to GITLAB April 6, 2026 17:54 Inactive

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

components/src/dynamo/sglang/args.py Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to GITLAB April 6, 2026 18:22 Inactive

ishandhanani force-pushed the idhanani/fix-streaming-perf-regression branch from 901572c to 1afe831 Compare April 6, 2026 18:32

copy-pr-bot bot temporarily deployed to GITLAB April 6, 2026 18:32 Inactive

ishandhanani force-pushed the idhanani/fix-streaming-perf-regression branch from 1afe831 to e4e8eee Compare April 6, 2026 18:36

copy-pr-bot bot temporarily deployed to GITLAB April 6, 2026 18:37 Inactive

ishandhanani force-pushed the idhanani/fix-streaming-perf-regression branch from e4e8eee to 98ef70e Compare April 6, 2026 18:38

copy-pr-bot bot temporarily deployed to GITLAB April 6, 2026 18:38 Inactive

copy-pr-bot bot temporarily deployed to GITLAB April 6, 2026 19:17 Inactive

KrishnanPrash approved these changes Apr 6, 2026

View reviewed changes

ishandhanani closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sglang): stop forcing incremental_streaming_output to fix high-concurrency throughput regression#7910

fix(sglang): stop forcing incremental_streaming_output to fix high-concurrency throughput regression#7910
ishandhanani wants to merge 1 commit intomainfrom
idhanani/fix-streaming-perf-regression

ishandhanani commented Apr 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

ishandhanani commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ishandhanani commented Apr 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ishandhanani commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ishandhanani commented Apr 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 6, 2026 •

edited

Loading