Skip to content

feat(sglang): enforce stream_output=True for optimal streaming performance#5510

Merged
MatejKosec merged 2 commits intomainfrom
user/mkosec/enforce_sglang_reponse_streaming_flag
Jan 20, 2026
Merged

feat(sglang): enforce stream_output=True for optimal streaming performance#5510
MatejKosec merged 2 commits intomainfrom
user/mkosec/enforce_sglang_reponse_streaming_flag

Conversation

@MatejKosec
Copy link
Copy Markdown
Contributor

@MatejKosec MatejKosec commented Jan 20, 2026

Summary

  • Enforce stream_output=True in SGLang ServerArgs for Dynamo
  • Update streaming handlers to pass through disjoint token segments directly (no more cumulative-to-delta conversion)
  • Applies to both LLM decode handler and multimodal worker handler

Description

With stream_output=True, SGLang sends only new tokens since the last output (disjoint segments) rather than all tokens generated so far (cumulative). This change:

  1. Forces stream_output=True in args.py after parsing ServerArgs
  2. Simplifies _process_token_stream in decode_handler - removes tracking/slicing logic
  3. Simplifies process_sglang_stream in multimodal worker_handler - same fix

This aligns Dynamo with SGLang's efficient streaming mode, reducing redundant data transfer.

Summary by CodeRabbit

  • Bug Fixes
    • Improved token streaming to deliver disjoint segments instead of cumulative tokens, ensuring more accurate and granular token delivery during streaming operations.
    • Enabled stream output mode in server configuration for consistent streaming behavior across LLM and multimodal handlers.

✏️ Tip: You can customize this high-level summary in your review settings.

…mance

Dynamo's streaming handlers now expect disjoint output_ids from SGLang
(only new tokens since last output) rather than cumulative tokens.

Changes:
- Force stream_output=True in args.py after parsing ServerArgs
- Update decode_handler to pass through disjoint token segments directly
- Update multimodal worker_handler with the same fix

This aligns Dynamo with SGLang's efficient streaming mode where only
delta tokens are transmitted, reducing redundant data transfer.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
@MatejKosec MatejKosec requested review from a team as code owners January 20, 2026 19:59
@github-actions github-actions bot added feat backend::sglang Relates to the sglang backend multimodal labels Jan 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 20, 2026

Walkthrough

The pull request enforces stream output with disjoint token segments across Dynamo's SGLang integration. Stream output is now forcibly enabled in argument parsing, and both token stream handlers are refactored to forward token segments directly rather than computing them from running totals or offsets.

Changes

Cohort / File(s) Summary
Stream Output Configuration
components/src/dynamo/sglang/args.py
Unconditionally enables stream_output on server args after parsing, logging a message if it was previously disabled. Ensures consistent streaming behavior.
Token Stream Processing
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py, components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py
Refactored _process_token_stream and StreamProcessor.process_sglang_stream to use disjoint token segments from output_ids instead of slicing from running offsets. Removes accumulated offset tracking and clarifies streaming semantics. Error handling preserved.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Streams now flow in segments bright,
No more offsets to recalculate each night,
Tokens hop along, disjoint and free,
SGLang streams as they should be! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: enforcing stream_output=True in SGLang for optimal streaming, which is the primary objective across all three modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description follows the required template structure with all key sections present: Overview (Summary), Details, and related guidance. It clearly explains the changes, rationale, and includes a test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
@MatejKosec MatejKosec merged commit 748fee6 into main Jan 20, 2026
32 of 33 checks passed
@MatejKosec MatejKosec deleted the user/mkosec/enforce_sglang_reponse_streaming_flag branch January 20, 2026 22:19
davilu-nvidia pushed a commit that referenced this pull request Jan 24, 2026
…mance (#5510)

This ensures that only new tokens are returned by sglang which avoids the overhead from creating copies of the entire token sequences per each iteration. These copies can become a bottleneck particularly for long sequence lengths and large concurrency counts.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: davilu <davilu@nvidia.com>
soodoshll pushed a commit to soodoshll/dynamo that referenced this pull request Feb 12, 2026
…mance (ai-dynamo#5510)

This ensures that only new tokens are returned by sglang which avoids the overhead from creating copies of the entire token sequences per each iteration. These copies can become a bottleneck particularly for long sequence lengths and large concurrency counts.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants