[Session] Add streaming mode with SessionAwareCache fast path#19171
[Session] Add streaming mode with SessionAwareCache fast path#19171hnyls2002 merged 20 commits intosgl-project:mainfrom
streaming mode with SessionAwareCache fast path#19171Conversation
Summary of ChangesHello @aurickq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant performance enhancement by implementing 'streaming sessions.' This feature is designed to reduce latency for sequential requests within a session by streamlining KV cache management. It achieves this by largely bypassing the radix cache for subsequent requests in a streaming session, opting instead for direct KV state inheritance. This change also intelligently adapts the system's idle memory checks to avoid interference with the latency-sensitive nature of streaming workloads, ensuring a smoother and more efficient user experience for continuous interactions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
| if req.session and req.session.streaming: | ||
| # For streaming sessions, only the prompt of the first request is cached, so we need to use | ||
| # cache_protected_len instead of prefix_indices length. | ||
| old_prefix_len = req.cache_protected_len |
There was a problem hiding this comment.
@ispobock not sure why it doesn't use old_prefix_len = req.cache_protected_len in all cases? seems simpler but not sure if i am missing some corner case
There was a problem hiding this comment.
Actually we should use cache_protected_len. Just this change #13714 wasn't applied to swa radix cache.
There was a problem hiding this comment.
I think I am not familiar enough with this code path to do it myself :) so I will leave it to someone else
There was a problem hiding this comment.
Code Review
The pull request introduces streaming sessions to optimize low-latency append-only workloads by bypassing radix cache operations. The core logic involves inheriting KV states between requests in a session and skipping cache insertion/matching for subsequent requests. The implementation is generally sound and correctly manages the handoff of KV cache memory. However, a critical issue was identified in the idle check logic where returning early skips the sleep call, which will cause 100% CPU usage when streaming sessions are active. Additionally, the session iteration in the idle check could be optimized for performance in scenarios with many open sessions.
| self.tree_cache.sanity_check() | ||
|
|
||
| def self_check_during_idle(self: Scheduler): | ||
| if any(s.streaming for s in self.sessions.values()): |
There was a problem hiding this comment.
|
/tag-and-rerun-ci |
|
@aurickq SUMMARY (4 sessions x 300 turns)
Mode Avg (all) Avg (last 10) Speedup (all) Speedup (last 10)
----------------- ----------- --------------- --------------- -------------------
no_session 98.5ms 153.7ms 1.00x 1.00x
regular_session 97.4ms 150.7ms 1.01x 1.02x
streaming_session 70.4ms 72.9ms 1.40x 2.11x |
|
/tag-and-rerun-ci |
|
No regression with new session wrapper abstraction: SUMMARY (4 sessions x 300 turns)
Mode Avg (all) Avg (last 10) Speedup (all) Speedup (last 10)
----------------- ----------- --------------- --------------- -------------------
no_session 97.4ms 150.4ms 1.00x 1.00x
regular_session 96.2ms 150.0ms 1.01x 1.00x
streaming_session 62.9ms 61.0ms 1.55x 2.47x
----------------------------------------------------------------------
Ran 3 tests in 98.622s |
streaming mode with SessionAwareCache fast path
|
thank you for taking this to the finish line @hnyls2002 |
…l-project#19171) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
|
Hi, @aurickq , may I ask why this PR is only for streaming requests? Will this be generalized to all responses API scenarios? |
…l-project#19171) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…l-project#19171) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Add
streamingoption toopen_sessionand plumb it through engine/session creation.Introduce
SessionAwareCacheto keep KV ownership in per-session slots and bypass radix prefix matching on streaming turns after the first request.Keep only the first streaming turn's prompt cacheable in radix and skip prompt/output cache insertion for later turns to avoid radix update overhead in latency-sensitive flows.
Restrict streaming sessions to append-only behavior (
replace/drop_previous_output/ non-zerooffsetare rejected).Add session timeout support and periodic reap logic in scheduler (
timeouton session +reap_timed_out_sessions).Extend memory/runtime checks to account for session-held KV and req slots outside active batches.
Add tests for streaming-session correctness and latency behavior.
Follow-up PRs will address the open concerns raised in review, including: (1) compatibility guard/assert for
streamingwith speculative decoding under KV over-allocation, and (2) remaining cleanup/perf/maintainability refinements around the session cache wrapper path.