feat(sglang): enforce stream_output=True for optimal streaming performance#5510
Conversation
…mance Dynamo's streaming handlers now expect disjoint output_ids from SGLang (only new tokens since last output) rather than cumulative tokens. Changes: - Force stream_output=True in args.py after parsing ServerArgs - Update decode_handler to pass through disjoint token segments directly - Update multimodal worker_handler with the same fix This aligns Dynamo with SGLang's efficient streaming mode where only delta tokens are transmitted, reducing redundant data transfer. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
WalkthroughThe pull request enforces stream output with disjoint token segments across Dynamo's SGLang integration. Stream output is now forcibly enabled in argument parsing, and both token stream handlers are refactored to forward token segments directly rather than computing them from running totals or offsets. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…mance (#5510) This ensures that only new tokens are returned by sglang which avoids the overhead from creating copies of the entire token sequences per each iteration. These copies can become a bottleneck particularly for long sequence lengths and large concurrency counts. Signed-off-by: Matej Kosec <mkosec@nvidia.com> Signed-off-by: davilu <davilu@nvidia.com>
…mance (ai-dynamo#5510) This ensures that only new tokens are returned by sglang which avoids the overhead from creating copies of the entire token sequences per each iteration. These copies can become a bottleneck particularly for long sequence lengths and large concurrency counts. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Summary
stream_output=Truein SGLang ServerArgs for DynamoDescription
With
stream_output=True, SGLang sends only new tokens since the last output (disjoint segments) rather than all tokens generated so far (cumulative). This change:stream_output=Trueinargs.pyafter parsing ServerArgs_process_token_streamin decode_handler - removes tracking/slicing logicprocess_sglang_streamin multimodal worker_handler - same fixThis aligns Dynamo with SGLang's efficient streaming mode, reducing redundant data transfer.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.