fix(sglang): stop forcing incremental_streaming_output to fix high-concurrency throughput regression#7910
fix(sglang): stop forcing incremental_streaming_output to fix high-concurrency throughput regression#7910ishandhanani wants to merge 1 commit intomainfrom
Conversation
c631c92 to
901572c
Compare
WalkthroughThis pull request modifies SGLang integration in Dynamo by removing forced streaming mode initialization and updating stream token processing to treat Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/src/dynamo/sglang/args.py`:
- Around line 376-382: Add a validation in the server args parsing to reject
SGLang incremental streaming mode: check getattr(server_args, "stream_output",
False) and if true raise a ValueError with a message that incremental streaming
(--incremental-streaming-output) is not supported because handlers expect
cumulative output_ids; follow the existing guard pattern used for
schedule_low_priority_values_first to place this check near where server_args is
validated so decode and multimodal handlers are protected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c3700f9a-5264-443c-b1d8-c5bf4672529b
📒 Files selected for processing (3)
components/src/dynamo/sglang/args.pycomponents/src/dynamo/sglang/request_handlers/llm/decode_handler.pycomponents/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py
901572c to
1afe831
Compare
1afe831 to
e4e8eee
Compare
…ncurrency throughput regression Dynamo v0.9.0+ forces stream_output=True (later incremental_streaming_output=True) on SGLang's ServerArgs. This switches SGLang's tokenizer_manager into incremental mode where every streaming chunk must be individually yielded -- no coalescing of intermediate chunks is possible without data loss. Under high concurrency (>=2048), this creates backpressure in the tokenizer_manager's ZMQ path, causing: - ~2x throughput regression in disaggregated PD serving - ~8x TTFT inflation at concurrency=2048 - Prefill #inflight-req stuck at 17-18 (vs 0 in good case) - Decode #running-req 3-4x fewer (starved) The fix restores v0.8.1 behavior: - Do not set incremental_streaming_output or stream_output on ServerArgs - Leave SGLang in its default cumulative output mode - Slice new tokens from cumulative output_ids in the decode handler (output_ids[num_output_tokens_so_far:]) to yield correct disjoint deltas This gives correct token statistics without the tokenizer_manager overhead. Same fix applied to the multimodal StreamProcessor. Ref: sgl-project/sglang#22095
e4e8eee to
98ef70e
Compare
|
This one is not needed |
Summary
incremental_streaming_output=True(orstream_output=True) on SGLang'sServerArgsRoot Cause
Dynamo v0.9.0 introduced forced
stream_output=True(commit 748fee6, PR #5510), which switches SGLang's tokenizer_manager into incremental streaming mode. In this mode, every streaming chunk carries only a delta and must be individually yielded -- the tokenizer_manager cannot coalesce intermediate chunks without losing tokens.Under high concurrency (>=2048 concurrent requests), this creates backpressure in the tokenizer_manager's ZMQ path:
#inflight-req#running-reqThis was identified as a perf regression by @sechoi in commit 049daef ("Make sglang stream_output optional to resolve perf regression") which was never merged from its feature branch.
Controlled experiments by @YAMY1234 confirmed that the same SGLang version (dev-0401) performs at full speed with Dynamo 0.8.1 (no forced incremental streaming) and at half speed with Dynamo 1.0.0 (forced incremental streaming). SGLang itself did not regress.
Fix
Restore v0.8.1 behavior:
args.py: Remove forcedincremental_streaming_output=True/stream_output=True. Leave SGLang in its default cumulative output mode.decode_handler.py: Restore cumulative slicing (output_ids[num_output_tokens_so_far:]) to extract disjoint deltas from cumulative output. This gives correct token statistics without the tokenizer_manager overhead of incremental mode.worker_handler.py: Same cumulative slicing fix for multimodalStreamProcessor.Test plan
sglang_benchorsa-benchRef: sgl-project/sglang#22095
Related: #7642, #7752
Summary by CodeRabbit