Skip to content

[Session] Add streaming mode with SessionAwareCache fast path#19171

Merged
hnyls2002 merged 20 commits intosgl-project:mainfrom
aurickq:streaming-session
Feb 28, 2026
Merged

[Session] Add streaming mode with SessionAwareCache fast path#19171
hnyls2002 merged 20 commits intosgl-project:mainfrom
aurickq:streaming-session

Conversation

@aurickq
Copy link
Copy Markdown
Contributor

@aurickq aurickq commented Feb 23, 2026

  • Add streaming option to open_session and plumb it through engine/session creation.

  • Introduce SessionAwareCache to keep KV ownership in per-session slots and bypass radix prefix matching on streaming turns after the first request.

  • Keep only the first streaming turn's prompt cacheable in radix and skip prompt/output cache insertion for later turns to avoid radix update overhead in latency-sensitive flows.

  • Restrict streaming sessions to append-only behavior (replace / drop_previous_output / non-zero offset are rejected).

  • Add session timeout support and periodic reap logic in scheduler (timeout on session + reap_timed_out_sessions).

  • Extend memory/runtime checks to account for session-held KV and req slots outside active batches.

  • Add tests for streaming-session correctness and latency behavior.

  • Follow-up PRs will address the open concerns raised in review, including: (1) compatibility guard/assert for streaming with speculative decoding under KV over-allocation, and (2) remaining cleanup/perf/maintainability refinements around the session cache wrapper path.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @aurickq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement by implementing 'streaming sessions.' This feature is designed to reduce latency for sequential requests within a session by streamlining KV cache management. It achieves this by largely bypassing the radix cache for subsequent requests in a streaming session, opting instead for direct KV state inheritance. This change also intelligently adapts the system's idle memory checks to avoid interference with the latency-sensitive nature of streaming workloads, ensuring a smoother and more efficient user experience for continuous interactions.

Highlights

  • Streaming Session Implementation: Introduced a streaming flag to open_session requests, enabling a low-latency path for sessions by optimizing KV cache operations.
  • Optimized KV Cache Management: For streaming sessions, only the very first request's prompt is cached. Subsequent requests skip prompt and output caching, and instead inherit KV states from the previous request without costly radix tree matching.
  • Idle Memory Check Suspension: Disabled the idle memory self-check mechanism when streaming sessions are active to prevent blocking latency-sensitive requests and to accommodate the unique KV memory holding pattern of streaming sessions.
  • Session Request Handling Logic: Modified session request creation to enforce append-only behavior for streaming sessions, disallowing operations like replace or offset that would complicate the low-latency design.
  • New Test Case: Added a new manual test case (test_streaming_session) to validate the functionality and performance characteristics of the new streaming session feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/managers/io_struct.py
    • Added an optional 'streaming' boolean field to the OpenSessionReqInput class to indicate if a session should be streaming.
  • python/sglang/srt/managers/schedule_batch.py
    • Imported the Session class for type hinting.
    • Updated the Req class's constructor to accept a Session object instead of a session ID, and added new flags skip_cache_unfinished and skip_cache_finished.
    • Modified init_next_round_input to bypass radix prefix matching for streaming sessions, allowing direct inheritance of KV states.
    • Introduced inherit_kv_states method in the Req class to facilitate efficient state transfer between sequential requests in a streaming session.
  • python/sglang/srt/managers/schedule_policy.py
    • Renamed _req_inc_lock_ref to _req_ensure_lock_ref and updated its logic to conditionally skip lock reference increments for streaming sessions.
    • Updated all call sites of the renamed method to _req_ensure_lock_ref.
  • python/sglang/srt/managers/scheduler.py
    • Added a condition to event_loop_normal to skip self_check_during_idle if any streaming sessions are active.
    • Modified open_session to pass the streaming flag to the Session constructor.
    • Enhanced close_session to properly handle KV cache release for streaming sessions, including detaching running requests.
  • python/sglang/srt/managers/scheduler_output_processor_mixin.py
    • Introduced maybe_release_kv_cache to manage KV cache release based on whether a request belongs to a streaming session.
    • Replaced direct release_kv_cache calls with maybe_release_kv_cache across various processing functions.
    • Added a check for skip_cache_unfinished before caching unfinished requests in the tree cache.
  • python/sglang/srt/managers/scheduler_runtime_checker_mixin.py
    • Modified self_check_during_idle to return early if any streaming sessions are present, preventing unnecessary checks.
  • python/sglang/srt/managers/session_controller.py
    • Added a streaming parameter to the Session class constructor and stored it as an attribute.
    • Updated create_req to enforce streaming session constraints, such as disallowing 'replace' or 'offset' operations.
    • Modified create_req to pass the Session object directly to the Req constructor.
    • Implemented logic in create_req for streaming sessions to set skip_cache_finished, inherit KV states, and set skip_cache_unfinished for subsequent requests.
  • python/sglang/srt/mem_cache/memory_pool.py
    • Adjusted an assertion in the alloc method to correctly count only chunked requests when checking for req_pool_idx reuse.
  • python/sglang/srt/mem_cache/swa_radix_cache.py
    • Modified cache_finished_req and cache_unfinished_req to use req.cache_protected_len for streaming sessions, reflecting the selective caching strategy.
  • test/manual/test_session_control.py
    • Added test_streaming_session to the list of manual tests to verify the new streaming session functionality.
Activity
  • The pull request introduces a new streaming flag in OpenSessionReqInput to enable a low-latency session path.
  • The Req object now directly holds a reference to its Session and includes flags to control cache insertion (skip_cache_unfinished, skip_cache_finished).
  • Logic was added to Req.init_next_round_input to allow streaming sessions to inherit KV states directly, bypassing radix cache matching for subsequent requests.
  • A new inherit_kv_states method was added to the Req class to facilitate efficient state transfer.
  • The Session class constructor was updated to accept and store the streaming flag.
  • The Session.create_req method was enhanced to enforce append-only behavior for streaming sessions and to manage KV state inheritance and caching flags.
  • The scheduler's self_check_during_idle mechanism is now conditionally skipped when streaming sessions are active to avoid latency impacts.
  • The Scheduler's open_session and close_session methods were updated to correctly initialize and tear down streaming sessions, including proper KV cache lock management.
  • The scheduler_output_processor_mixin was refactored with a maybe_release_kv_cache method to handle conditional KV cache release based on session type.
  • The radix cache logic in swa_radix_cache.py was adjusted to account for the specific caching behavior of streaming sessions, using cache_protected_len.
  • A new manual test, test_streaming_session, was added to validate the functionality of streaming sessions, including cache behavior and memory management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

if req.session and req.session.streaming:
# For streaming sessions, only the prompt of the first request is cached, so we need to use
# cache_protected_len instead of prefix_indices length.
old_prefix_len = req.cache_protected_len
Copy link
Copy Markdown
Contributor Author

@aurickq aurickq Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ispobock not sure why it doesn't use old_prefix_len = req.cache_protected_len in all cases? seems simpler but not sure if i am missing some corner case

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should use cache_protected_len. Just this change #13714 wasn't applied to swa radix cache.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am not familiar enough with this code path to do it myself :) so I will leave it to someone else

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces streaming sessions to optimize low-latency append-only workloads by bypassing radix cache operations. The core logic involves inheriting KV states between requests in a session and skipping cache insertion/matching for subsequent requests. The implementation is generally sound and correctly manages the handoff of KV cache memory. However, a critical issue was identified in the idle check logic where returning early skips the sleep call, which will cause 100% CPU usage when streaming sessions are active. Additionally, the session iteration in the idle check could be optimized for performance in scenarios with many open sessions.

Comment thread python/sglang/srt/managers/scheduler_runtime_checker_mixin.py Outdated
self.tree_cache.sanity_check()

def self_check_during_idle(self: Scheduler):
if any(s.streaming for s in self.sessions.values()):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Iterating over all sessions using any(s.streaming for s in self.sessions.values()) on every idle check can become a performance bottleneck if there are a large number of open sessions. Consider maintaining a counter of active streaming sessions in the Scheduler class to make this check O(1).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fine

@ispobock
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@hnyls2002 hnyls2002 removed the run-ci label Feb 26, 2026
@hnyls2002
Copy link
Copy Markdown
Collaborator

@aurickq
Added a latency test in CI to prevent performance regression:

  SUMMARY  (4 sessions x 300 turns)
Mode                 Avg (all)    Avg (last 10)    Speedup (all)    Speedup (last 10)
-----------------  -----------  ---------------  ---------------  -------------------
no_session              98.5ms          153.7ms            1.00x                1.00x
regular_session         97.4ms          150.7ms            1.01x                1.02x
streaming_session       70.4ms           72.9ms            1.40x                2.11x

@hnyls2002
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@hnyls2002
Copy link
Copy Markdown
Collaborator

No regression with new session wrapper abstraction:

  SUMMARY  (4 sessions x 300 turns)
Mode                 Avg (all)    Avg (last 10)    Speedup (all)    Speedup (last 10)
-----------------  -----------  ---------------  ---------------  -------------------
no_session              97.4ms          150.4ms            1.00x                1.00x
regular_session         96.2ms          150.0ms            1.01x                1.00x
streaming_session       62.9ms           61.0ms            1.55x                2.47x

----------------------------------------------------------------------
Ran 3 tests in 98.622s

Comment thread python/sglang/srt/mem_cache/common.py
@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 changed the title Implement streaming sessions [Session] Add streaming mode with SessionAwareCache fast path Feb 28, 2026
@hnyls2002 hnyls2002 merged commit c6cb0c9 into sgl-project:main Feb 28, 2026
387 of 425 checks passed
@aurickq
Copy link
Copy Markdown
Contributor Author

aurickq commented Feb 28, 2026

thank you for taking this to the finish line @hnyls2002

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
@llc-kc
Copy link
Copy Markdown
Contributor

llc-kc commented Mar 11, 2026

Hi, @aurickq , may I ask why this PR is only for streaming requests? Will this be generalized to all responses API scenarios?
Thanks.

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
@hnyls2002 hnyls2002 mentioned this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants