Skip to content

fix(sglang): stop forcing incremental_streaming_output to fix high-concurrency throughput regression#7910

Closed
ishandhanani wants to merge 1 commit intomainfrom
idhanani/fix-streaming-perf-regression
Closed

fix(sglang): stop forcing incremental_streaming_output to fix high-concurrency throughput regression#7910
ishandhanani wants to merge 1 commit intomainfrom
idhanani/fix-streaming-perf-regression

Conversation

@ishandhanani
Copy link
Copy Markdown
Contributor

@ishandhanani ishandhanani commented Apr 6, 2026

Summary

  • Stop forcing incremental_streaming_output=True (or stream_output=True) on SGLang's ServerArgs
  • Restore cumulative token slicing in decode handler and multimodal stream processor
  • Fixes ~2x throughput regression in disaggregated PD serving at high concurrency

Root Cause

Dynamo v0.9.0 introduced forced stream_output=True (commit 748fee6, PR #5510), which switches SGLang's tokenizer_manager into incremental streaming mode. In this mode, every streaming chunk carries only a delta and must be individually yielded -- the tokenizer_manager cannot coalesce intermediate chunks without losing tokens.

Under high concurrency (>=2048 concurrent requests), this creates backpressure in the tokenizer_manager's ZMQ path:

Metric v0.8.1 (cumulative, good) v0.9.0+ (incremental, bad)
Throughput (C=4096) 132,140 tok/s 69,767 tok/s (-47%)
TTFT (C=2048) 1,301 ms 10,215 ms (+7.9x)
Prefill #inflight-req 0 17-18
Decode #running-req 77-166 36-44

This was identified as a perf regression by @sechoi in commit 049daef ("Make sglang stream_output optional to resolve perf regression") which was never merged from its feature branch.

Controlled experiments by @YAMY1234 confirmed that the same SGLang version (dev-0401) performs at full speed with Dynamo 0.8.1 (no forced incremental streaming) and at half speed with Dynamo 1.0.0 (forced incremental streaming). SGLang itself did not regress.

Fix

Restore v0.8.1 behavior:

  1. args.py: Remove forced incremental_streaming_output=True / stream_output=True. Leave SGLang in its default cumulative output mode.

  2. decode_handler.py: Restore cumulative slicing (output_ids[num_output_tokens_so_far:]) to extract disjoint deltas from cumulative output. This gives correct token statistics without the tokenizer_manager overhead of incremental mode.

  3. worker_handler.py: Same cumulative slicing fix for multimodal StreamProcessor.

Test plan

  • Verify token counts are correct (no triangular-sum inflation) with sglang_bench or sa-bench
  • Verify throughput at high concurrency (C>=2048) matches v0.8.1 baseline in disagg PD setup
  • Run existing CI (agg + disagg integration tests)

Ref: sgl-project/sglang#22095
Related: #7642, #7752

Summary by CodeRabbit

  • Bug Fixes
    • Updated token streaming to use cumulative mode by default instead of forcing streaming output.
    • Improved token processing to correctly yield only newly generated tokens in each response chunk, enhancing consistency across response handlers.

@ishandhanani ishandhanani requested review from a team as code owners April 6, 2026 17:54
@github-actions github-actions bot added fix backend::sglang Relates to the sglang backend multimodal labels Apr 6, 2026
@ishandhanani ishandhanani force-pushed the idhanani/fix-streaming-perf-regression branch from c631c92 to 901572c Compare April 6, 2026 17:54
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 6, 2026

Walkthrough

This pull request modifies SGLang integration in Dynamo by removing forced streaming mode initialization and updating stream token processing to treat output_ids as cumulative token sequences rather than per-chunk increments across handler implementations.

Changes

Cohort / File(s) Summary
Streaming Configuration
components/src/dynamo/sglang/args.py
Removed logic that forcibly enabled SGLang streaming modes (incremental_streaming_output or stream_output); reverts to default cumulative mode behavior.
Stream Token Processors
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py, components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py
Updated _process_token_stream and process_sglang_stream to handle output_ids as cumulative sequences; added tracking (num_output_tokens_so_far) and slicing logic to extract only new tokens per chunk while preserving existing skip/finish-reason behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: stopping forced incremental_streaming_output to fix a throughput regression.
Description check ✅ Passed The description is comprehensive, covering summary, root cause analysis, fix details, and test plan. It fully exceeds the basic template requirements with technical depth and context.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/sglang/args.py`:
- Around line 376-382: Add a validation in the server args parsing to reject
SGLang incremental streaming mode: check getattr(server_args, "stream_output",
False) and if true raise a ValueError with a message that incremental streaming
(--incremental-streaming-output) is not supported because handlers expect
cumulative output_ids; follow the existing guard pattern used for
schedule_low_priority_values_first to place this check near where server_args is
validated so decode and multimodal handlers are protected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c3700f9a-5264-443c-b1d8-c5bf4672529b

📥 Commits

Reviewing files that changed from the base of the PR and between 0ba80f6 and 901572c.

📒 Files selected for processing (3)
  • components/src/dynamo/sglang/args.py
  • components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
  • components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py

@ishandhanani ishandhanani force-pushed the idhanani/fix-streaming-perf-regression branch from 901572c to 1afe831 Compare April 6, 2026 18:32
@ishandhanani ishandhanani force-pushed the idhanani/fix-streaming-perf-regression branch from 1afe831 to e4e8eee Compare April 6, 2026 18:36
…ncurrency throughput regression

Dynamo v0.9.0+ forces stream_output=True (later incremental_streaming_output=True)
on SGLang's ServerArgs. This switches SGLang's tokenizer_manager into incremental
mode where every streaming chunk must be individually yielded -- no coalescing of
intermediate chunks is possible without data loss. Under high concurrency (>=2048),
this creates backpressure in the tokenizer_manager's ZMQ path, causing:
- ~2x throughput regression in disaggregated PD serving
- ~8x TTFT inflation at concurrency=2048
- Prefill #inflight-req stuck at 17-18 (vs 0 in good case)
- Decode #running-req 3-4x fewer (starved)

The fix restores v0.8.1 behavior:
- Do not set incremental_streaming_output or stream_output on ServerArgs
- Leave SGLang in its default cumulative output mode
- Slice new tokens from cumulative output_ids in the decode handler
  (output_ids[num_output_tokens_so_far:]) to yield correct disjoint deltas

This gives correct token statistics without the tokenizer_manager overhead.
Same fix applied to the multimodal StreamProcessor.

Ref: sgl-project/sglang#22095
@ishandhanani ishandhanani force-pushed the idhanani/fix-streaming-perf-regression branch from e4e8eee to 98ef70e Compare April 6, 2026 18:38
@ishandhanani
Copy link
Copy Markdown
Contributor Author

This one is not needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants