[Session] Add `streaming` mode with `SessionAwareCache` fast path by aurickq · Pull Request #19171 · sgl-project/sglang

aurickq · 2026-02-23T03:31:54Z

Add streaming option to open_session and plumb it through engine/session creation.
Introduce SessionAwareCache to keep KV ownership in per-session slots and bypass radix prefix matching on streaming turns after the first request.
Keep only the first streaming turn's prompt cacheable in radix and skip prompt/output cache insertion for later turns to avoid radix update overhead in latency-sensitive flows.
Restrict streaming sessions to append-only behavior (replace / drop_previous_output / non-zero offset are rejected).
Add session timeout support and periodic reap logic in scheduler (timeout on session + reap_timed_out_sessions).
Extend memory/runtime checks to account for session-held KV and req slots outside active batches.
Add tests for streaming-session correctness and latency behavior.
Follow-up PRs will address the open concerns raised in review, including: (1) compatibility guard/assert for streaming with speculative decoding under KV over-allocation, and (2) remaining cleanup/perf/maintainability refinements around the session cache wrapper path.

gemini-code-assist · 2026-02-23T03:32:18Z

Summary of Changes

Hello @aurickq, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement by implementing 'streaming sessions.' This feature is designed to reduce latency for sequential requests within a session by streamlining KV cache management. It achieves this by largely bypassing the radix cache for subsequent requests in a streaming session, opting instead for direct KV state inheritance. This change also intelligently adapts the system's idle memory checks to avoid interference with the latency-sensitive nature of streaming workloads, ensuring a smoother and more efficient user experience for continuous interactions.

Highlights

Streaming Session Implementation: Introduced a streaming flag to open_session requests, enabling a low-latency path for sessions by optimizing KV cache operations.
Optimized KV Cache Management: For streaming sessions, only the very first request's prompt is cached. Subsequent requests skip prompt and output caching, and instead inherit KV states from the previous request without costly radix tree matching.
Idle Memory Check Suspension: Disabled the idle memory self-check mechanism when streaming sessions are active to prevent blocking latency-sensitive requests and to accommodate the unique KV memory holding pattern of streaming sessions.
Session Request Handling Logic: Modified session request creation to enforce append-only behavior for streaming sessions, disallowing operations like replace or offset that would complicate the low-latency design.
New Test Case: Added a new manual test case (test_streaming_session) to validate the functionality and performance characteristics of the new streaming session feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/managers/io_struct.py
- Added an optional 'streaming' boolean field to the OpenSessionReqInput class to indicate if a session should be streaming.
python/sglang/srt/managers/schedule_batch.py
- Imported the Session class for type hinting.
- Updated the Req class's constructor to accept a Session object instead of a session ID, and added new flags skip_cache_unfinished and skip_cache_finished.
- Modified init_next_round_input to bypass radix prefix matching for streaming sessions, allowing direct inheritance of KV states.
- Introduced inherit_kv_states method in the Req class to facilitate efficient state transfer between sequential requests in a streaming session.
python/sglang/srt/managers/schedule_policy.py
- Renamed _req_inc_lock_ref to _req_ensure_lock_ref and updated its logic to conditionally skip lock reference increments for streaming sessions.
- Updated all call sites of the renamed method to _req_ensure_lock_ref.
python/sglang/srt/managers/scheduler.py
- Added a condition to event_loop_normal to skip self_check_during_idle if any streaming sessions are active.
- Modified open_session to pass the streaming flag to the Session constructor.
- Enhanced close_session to properly handle KV cache release for streaming sessions, including detaching running requests.
python/sglang/srt/managers/scheduler_output_processor_mixin.py
- Introduced maybe_release_kv_cache to manage KV cache release based on whether a request belongs to a streaming session.
- Replaced direct release_kv_cache calls with maybe_release_kv_cache across various processing functions.
- Added a check for skip_cache_unfinished before caching unfinished requests in the tree cache.
python/sglang/srt/managers/scheduler_runtime_checker_mixin.py
- Modified self_check_during_idle to return early if any streaming sessions are present, preventing unnecessary checks.
python/sglang/srt/managers/session_controller.py
- Added a streaming parameter to the Session class constructor and stored it as an attribute.
- Updated create_req to enforce streaming session constraints, such as disallowing 'replace' or 'offset' operations.
- Modified create_req to pass the Session object directly to the Req constructor.
- Implemented logic in create_req for streaming sessions to set skip_cache_finished, inherit KV states, and set skip_cache_unfinished for subsequent requests.
python/sglang/srt/mem_cache/memory_pool.py
- Adjusted an assertion in the alloc method to correctly count only chunked requests when checking for req_pool_idx reuse.
python/sglang/srt/mem_cache/swa_radix_cache.py
- Modified cache_finished_req and cache_unfinished_req to use req.cache_protected_len for streaming sessions, reflecting the selective caching strategy.
test/manual/test_session_control.py
- Added test_streaming_session to the list of manual tests to verify the new streaming session functionality.

Activity

The pull request introduces a new streaming flag in OpenSessionReqInput to enable a low-latency session path.
The Req object now directly holds a reference to its Session and includes flags to control cache insertion (skip_cache_unfinished, skip_cache_finished).
Logic was added to Req.init_next_round_input to allow streaming sessions to inherit KV states directly, bypassing radix cache matching for subsequent requests.
A new inherit_kv_states method was added to the Req class to facilitate efficient state transfer.
The Session class constructor was updated to accept and store the streaming flag.
The Session.create_req method was enhanced to enforce append-only behavior for streaming sessions and to manage KV state inheritance and caching flags.
The scheduler's self_check_during_idle mechanism is now conditionally skipped when streaming sessions are active to avoid latency impacts.
The Scheduler's open_session and close_session methods were updated to correctly initialize and tear down streaming sessions, including proper KV cache lock management.
The scheduler_output_processor_mixin was refactored with a maybe_release_kv_cache method to handle conditional KV cache release based on session type.
The radix cache logic in swa_radix_cache.py was adjusted to account for the specific caching behavior of streaming sessions, using cache_protected_len.
A new manual test, test_streaming_session, was added to validate the functionality of streaming sessions, including cache behavior and memory management.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

aurickq · 2026-02-23T03:35:54Z

+        if req.session and req.session.streaming:
+            # For streaming sessions, only the prompt of the first request is cached, so we need to use
+            # cache_protected_len instead of prefix_indices length.
+            old_prefix_len = req.cache_protected_len


@ispobock not sure why it doesn't use old_prefix_len = req.cache_protected_len in all cases? seems simpler but not sure if i am missing some corner case

Actually we should use cache_protected_len. Just this change #13714 wasn't applied to swa radix cache.

I think I am not familiar enough with this code path to do it myself :) so I will leave it to someone else

gemini-code-assist

Code Review

The pull request introduces streaming sessions to optimize low-latency append-only workloads by bypassing radix cache operations. The core logic involves inheriting KV states between requests in a session and skipping cache insertion/matching for subsequent requests. The implementation is generally sound and correctly manages the handoff of KV cache memory. However, a critical issue was identified in the idle check logic where returning early skips the sleep call, which will cause 100% CPU usage when streaming sessions are active. Additionally, the session iteration in the idle check could be optimized for performance in scenarios with many open sessions.

gemini-code-assist · 2026-02-23T03:37:21Z

            self.tree_cache.sanity_check()

    def self_check_during_idle(self: Scheduler):
+        if any(s.streaming for s in self.sessions.values()):


Iterating over all sessions using any(s.streaming for s in self.sessions.values()) on every idle check can become a performance bottleneck if there are a large number of open sessions. Consider maintaining a counter of active streaming sessions in the Scheduler class to make this check O(1).

should be fine

ispobock · 2026-02-23T11:04:48Z

/tag-and-rerun-ci

hnyls2002 · 2026-02-26T22:51:16Z

@aurickq
Added a latency test in CI to prevent performance regression:

  SUMMARY  (4 sessions x 300 turns)
Mode                 Avg (all)    Avg (last 10)    Speedup (all)    Speedup (last 10)
-----------------  -----------  ---------------  ---------------  -------------------
no_session              98.5ms          153.7ms            1.00x                1.00x
regular_session         97.4ms          150.7ms            1.01x                1.02x
streaming_session       70.4ms           72.9ms            1.40x                2.11x

hnyls2002 · 2026-02-27T00:54:56Z

/tag-and-rerun-ci

hnyls2002 · 2026-02-27T00:58:32Z

No regression with new session wrapper abstraction:

  SUMMARY  (4 sessions x 300 turns)
Mode                 Avg (all)    Avg (last 10)    Speedup (all)    Speedup (last 10)
-----------------  -----------  ---------------  ---------------  -------------------
no_session              97.4ms          150.4ms            1.00x                1.00x
regular_session         96.2ms          150.0ms            1.01x                1.00x
streaming_session       62.9ms           61.0ms            1.55x                2.47x

----------------------------------------------------------------------
Ran 3 tests in 98.622s

hnyls2002 · 2026-02-28T00:12:34Z

All CIs passed: https://github.com/sgl-project/sglang/actions/runs/22467942384/job/65160204758?pr=19171

aurickq · 2026-02-28T00:33:09Z

thank you for taking this to the finish line @hnyls2002

…l-project#19171) Co-authored-by: hnyls2002 <lsyincs@gmail.com>

llc-kc · 2026-03-11T08:44:13Z

Hi, @aurickq , may I ask why this PR is only for streaming requests? Will this be generalized to all responses API scenarios?
Thanks.

…l-project#19171) Co-authored-by: hnyls2002 <lsyincs@gmail.com>

aurickq added 2 commits February 23, 2026 03:26

implement streaming sessions

6a319e1

Merge remote-tracking branch 'upstream/main' into streaming-session

8f17a47

aurickq requested review from Ying1123, hanming-lu, hnyls2002, merrymercy, xiezhq-hermann and yizhang2077 as code owners February 23, 2026 03:31

aurickq commented Feb 23, 2026

View reviewed changes

gemini-code-assist Bot reviewed Feb 23, 2026

View reviewed changes

aurickq added 3 commits February 23, 2026 03:39

lint

8bc45be

lint

7d2131c

gemini

23f4994

ispobock assigned Qiaolin-Yu Feb 23, 2026

github-actions Bot added the run-ci label Feb 23, 2026

aurickq and others added 4 commits February 23, 2026 18:53

test

9e1a01a

lint

fc77b06

Merge branch 'main' into streaming-session

8d2a7eb

add idle session timeout

ec78a61

aurickq requested review from CatherineSue, JustinTong0323 and slin1237 as code owners February 24, 2026 00:06

Qiaolin-Yu assigned hnyls2002 and ispobock Feb 24, 2026

Qiaolin-Yu added the high priority label Feb 24, 2026

Qiaolin-Yu approved these changes Feb 24, 2026

View reviewed changes

aurickq and others added 2 commits February 24, 2026 18:12

Merge branch 'main' into streaming-session

5e5f9e3

add lantecy benchmark

450d605

register with large 1 gpu test

623f070

hnyls2002 removed the run-ci label Feb 26, 2026

hnyls2002 added 2 commits February 26, 2026 14:48

add assertion

3c23335

rename

7a87e06

hnyls2002 added 6 commits February 26, 2026 14:59

tiny align

6b17643

tiny remove redundant sessino_id field

08afa86

Merge branch 'main' into streaming-session

7d92fe0

init session wrapper

744f326

renaming revert

a6f94c2

revert maybe release

77a8e92

github-actions Bot added the run-ci label Feb 27, 2026

ispobock approved these changes Feb 27, 2026

View reviewed changes

Comment thread python/sglang/srt/mem_cache/common.py

hnyls2002 changed the title ~~Implement streaming sessions~~ [Session] Add streaming mode with SessionAwareCache fast path Feb 28, 2026

hnyls2002 merged commit c6cb0c9 into sgl-project:main Feb 28, 2026
387 of 425 checks passed

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

[Session] Add streaming mode with SessionAwareCache fast path (sg…

9509297

…l-project#19171) Co-authored-by: hnyls2002 <lsyincs@gmail.com>

ishandhanani mentioned this pull request Mar 14, 2026

feat: enable ephemeral kv cache sessions via sglang ai-dynamo/dynamo#7384

Closed

4 tasks

lawrence-harmonic mentioned this pull request Mar 20, 2026

[Feature] Limit to one cached Mamba state per multi-turn rollout #20144

Open

2 tasks

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[Session] Add streaming mode with SessionAwareCache fast path (sg…

a3d7c72

…l-project#19171) Co-authored-by: hnyls2002 <lsyincs@gmail.com>

lawrence-harmonic mentioned this pull request Mar 27, 2026

[Feature] [RFC] HiCache retention slots #21542

Open

2 tasks

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Session] Add streaming mode with SessionAwareCache fast path (sg…

8b02471

…l-project#19171) Co-authored-by: hnyls2002 <lsyincs@gmail.com>

eicherseiji mentioned this pull request Apr 20, 2026

[Serve][LLM][SGLang] RL weight sync and inference-trainer coordination ray-project/ray#62794

Open

5 tasks

hnyls2002 mentioned this pull request Apr 28, 2026

Deepseek V4 #23882

Merged

Conversation

aurickq commented Feb 23, 2026 • edited by hnyls2002 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Feb 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

aurickq Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ispobock Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

aurickq Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

aurickq Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

ispobock commented Feb 23, 2026

Uh oh!

hnyls2002 commented Feb 26, 2026

Uh oh!

hnyls2002 commented Feb 27, 2026

Uh oh!

hnyls2002 commented Feb 27, 2026

Uh oh!

Uh oh!

hnyls2002 commented Feb 28, 2026

Uh oh!

Uh oh!

aurickq commented Feb 28, 2026

Uh oh!

llc-kc commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aurickq commented Feb 23, 2026 •

edited by hnyls2002

Loading

aurickq Feb 23, 2026 •

edited

Loading

llc-kc commented Mar 11, 2026 •

edited

Loading