[Core][KV] Retain prefix-cache across hybrid SWA+Full via is_pinned blocks#40676
[Core][KV] Retain prefix-cache across hybrid SWA+Full via is_pinned blocks#40676jhaotingc wants to merge 4 commits into
is_pinned blocks#40676Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a prefix-cache pinning mechanism to enhance cache retention, managed through new environment variables. It introduces a pinned tier for free blocks that are only demoted to the regular free queue under memory pressure. The review feedback suggests refactoring the pinned_free_deque to use the existing FreeKVCacheBlockQueue infrastructure. This change would allow for O(1) block removal during touch operations and more efficient batch processing in the demote_n and free_blocks methods, avoiding potential performance bottlenecks and stale entries associated with the current deque implementation.
| # Oldest-first deque of blocks at ref_cnt=0 AND is_pinned=True. | ||
| # These blocks are prefix-cache retention candidates. They are | ||
| # NOT drained by get_new_blocks directly; demote_n() under | ||
| # pressure flips is_pinned=False and moves them to free_block_queue. | ||
| from collections import deque as _deque | ||
|
|
||
| self.pinned_free_deque: _deque = _deque() |
There was a problem hiding this comment.
Using a collections.deque for pinned_free_deque introduces a potential memory leak and performance bottleneck. Since touch (line 414) skips removing blocks from this deque to avoid O(N) complexity, the deque can accumulate a large number of stale entries (blocks that have been re-activated or even re-freed into the regular queue). In a long-running server, this deque could grow significantly, and demote_n would have to iterate through many stale entries.
Instead, you should leverage the existing O(1) doubly linked list infrastructure. By using another instance of FreeKVCacheBlockQueue, you can achieve O(1) removal in touch without stale entries, and O(1) batch operations in demote_n and free_blocks, all while reusing the prev_free_block and next_free_block pointers already present in KVCacheBlock (since a block is either in the regular free queue, the pinned queue, or active).
| # Oldest-first deque of blocks at ref_cnt=0 AND is_pinned=True. | |
| # These blocks are prefix-cache retention candidates. They are | |
| # NOT drained by get_new_blocks directly; demote_n() under | |
| # pressure flips is_pinned=False and moves them to free_block_queue. | |
| from collections import deque as _deque | |
| self.pinned_free_deque: _deque = _deque() | |
| # Oldest-first queue of blocks at ref_cnt=0 AND is_pinned=True. | |
| # These blocks are prefix-cache retention candidates. They are | |
| # NOT drained by get_new_blocks directly; demote_n() under | |
| # pressure flips is_pinned=False and moves them to free_block_queue. | |
| # Reuses the same linked-list pointers as free_block_queue. | |
| from vllm.v1.core.kv_cache_utils import FreeKVCacheBlockQueue | |
| self.pinned_block_queue = FreeKVCacheBlockQueue([]) |
There was a problem hiding this comment.
Done. __init__ now constructs an empty FreeKVCacheBlockQueue for the pinned tier. The invariant that a block lives in exactly one of free_block_queue (is_pinned=False) or pinned_block_queue (is_pinned=True) is already enforced by the is_pinned routing in free_blocks(), so the shared prev_free_block/next_free_block pointers on KVCacheBlock are safe.
| # ref_cnt=0 means this block is in some free tier (regular | ||
| # queue if is_pinned=False, pinned_free_deque if is_pinned=True). | ||
| if block.ref_cnt == 0 and not block.is_null: | ||
| self.free_block_queue.remove(block) | ||
| if block.is_pinned: | ||
| # Stale entries are common after demote_n; avoid O(n) | ||
| # removal by leaving the stale entry in place — demote_n | ||
| # will skip it on the next pop. | ||
| pass | ||
| else: | ||
| self.free_block_queue.remove(block) |
There was a problem hiding this comment.
With the suggested refactor to use FreeKVCacheBlockQueue for pinned blocks, touch can now perform O(1) removal for both regular and pinned tiers. This eliminates the need for stale entries and keeps the data structures clean. Additionally, it is safer to explicitly reset is_pinned to False when a block is activated via touch to ensure its state is consistent with its active status.
# ref_cnt=0 means this block is in some free tier (regular
# queue if is_pinned=False, pinned_block_queue if is_pinned=True).
if block.ref_cnt == 0 and not block.is_null:
if block.is_pinned:
self.pinned_block_queue.remove(block)
block.is_pinned = False
else:
self.free_block_queue.remove(block)There was a problem hiding this comment.
Done. touch now calls self.pinned_block_queue.remove(block) (O(1) via prev/next pointers) when the block is in the pinned tier, eliminating stale-entry accumulation.
… review) Replace collections.deque with a second FreeKVCacheBlockQueue instance (pinned_block_queue) for the ref_cnt=0 && is_pinned=True tier. This addresses Copilot review comments on PR vllm-project#40676: - touch() now does O(1) remove() from either queue via the block prev/next pointers; no more stale-entry accumulation in the pinned deque. - demote_n() uses batched popleft_n + append_n instead of a per-block loop that updated tail pointers on every iteration. - free_blocks() batches both tiers with append_n for consistency. Invariant: a block is in exactly one of free_block_queue (is_pinned=False) or pinned_block_queue (is_pinned=True), never both -- the prev/next pointers on KVCacheBlock only support one linked list at a time. This is already guaranteed by the is_pinned routing in free_blocks(). Semantics unchanged: touch() leaves is_pinned untouched so a later free_blocks() can re-route to the pinned tier when still a retention candidate. Pins survive cache-hit-then-release cycles. Validated on Gemma-4-31B-it TP=4 H200 48-prefix sweep (28k ISL): - Warmup (cold) TTFT avg: 1364 ms - Sweep (warm) TTFT avg: 305 ms (4.47x faster, p99 12.85x faster) - Full prefix-cache hit confirmed on 2nd pass; no hangs at pool limit. Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
|
@claude review |
|
@claude review |
| # Cannot allocate new blocks | ||
| return None | ||
| num_free_blocks = self.block_pool.get_num_free_blocks() | ||
| if num_blocks_to_allocate > num_free_blocks: |
There was a problem hiding this comment.
Don't we want to guard here with and envs.VLLM_PIN_PREFIX_BLOCKS as well?
There was a problem hiding this comment.
Fixed. Thank you!
| # Route ref_cnt==0 blocks to the correct tier; batch both. | ||
| regular_free: list[KVCacheBlock] = [] | ||
| pinned_free: list[KVCacheBlock] = [] | ||
| for block in blocks_list: |
There was a problem hiding this comment.
Not sure which is better (or if it actually impacts performance), but have you tried profiling doing two list comprehensions, instead of a loop and appending to lists?
There was a problem hiding this comment.
Changed to list comprehensions. Thank you!
| for b in to_pin: | ||
| b.is_pinned = True |
There was a problem hiding this comment.
Can do this inside the loop above, before (or after) the to_pin.append call
| from vllm.v1.kv_cache_interface import SlidingWindowSpec | ||
|
|
||
| if isinstance(self.kv_cache_spec, SlidingWindowSpec): | ||
| pin_blocks = envs.VLLM_PIN_SWA_TOKENS // self.block_size |
There was a problem hiding this comment.
Have you looked at how this feature affects Mamba-hybrid models? I'm wondering if we could generalize this, in a way such that adding fully-fledged Mamba support won't require changes in this file, for example
There was a problem hiding this comment.
Mamba already has 3 cache mode, 'none', 'align', and 'full'.
For 'full' mode, it keeps the last mamba state of every chunks, so if chunk size is 8k and 64k ISL, it keeps all the 8k states.
For 'align' mode, it only keeps partial chunks (say a chunk size is 8k, for a 64k ISL, it may keep arbitarary any 8k states).
For 'none' mode, it only keeps the very last state.
In another word, this sliding window pining "frees up" the OOW windows earlier than the last windows, but mamba already keeps only the chunk edge states, the caching is already limited to chunk edges and there's no intermediate mamba states stored. So I think this is not generalizable.
|
This pull request has merge conflicts that must be resolved before it can be |
ec6bd2f to
1f5a2c0
Compare
… review) Replace collections.deque with a second FreeKVCacheBlockQueue instance (pinned_block_queue) for the ref_cnt=0 && is_pinned=True tier. This addresses Copilot review comments on PR vllm-project#40676: - touch() now does O(1) remove() from either queue via the block prev/next pointers; no more stale-entry accumulation in the pinned deque. - demote_n() uses batched popleft_n + append_n instead of a per-block loop that updated tail pointers on every iteration. - free_blocks() batches both tiers with append_n for consistency. Invariant: a block is in exactly one of free_block_queue (is_pinned=False) or pinned_block_queue (is_pinned=True), never both -- the prev/next pointers on KVCacheBlock only support one linked list at a time. This is already guaranteed by the is_pinned routing in free_blocks(). Semantics unchanged: touch() leaves is_pinned untouched so a later free_blocks() can re-route to the pinned tier when still a retention candidate. Pins survive cache-hit-then-release cycles. Validated on Gemma-4-31B-it TP=4 H200 48-prefix sweep (28k ISL): - Warmup (cold) TTFT avg: 1364 ms - Sweep (warm) TTFT avg: 305 ms (4.47x faster, p99 12.85x faster) - Full prefix-cache hit confirmed on 2nd pass; no hangs at pool limit. Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
… paths After rebasing the prefix-cache pinning series (vllm-project#40676) onto upstream/main, two newly-added code paths needed wiring before the pin/demote mechanism could actually function under pressure, and a round of reviewer feedback applied. Rebase fixes (engine deadlocked without these): - kv_cache_manager.py: the upstream-added `full_sequence_must_fit` admission gate in allocate_slots returns None without giving the pinned tier a chance to release. Add a VLLM_PIN_PREFIX_BLOCKS-guarded demote_n call inside that branch so the existing pressure-recovery logic engages before admission is refused. - single_type_kv_cache_manager.py: the upstream-added SlidingWindowManager._cache_block_mask elides older SWA-segment blocks from the prefix-cache hash map ('they get dropped anyway, never serve a hit'). That defeats VLLM_PIN_SWA_TOKENS, whose entire purpose is to keep those blocks alive for future hits. Short-circuit the mask to None when either pin flag is set. PR review (@roikoren755): - kv_cache_manager.py: guard the lower allocate_slots demote site with envs.VLLM_PIN_PREFIX_BLOCKS so it is a no-op for users who do not opt in. - block_pool.py: refactor the free_blocks routing from a single loop with two appends into two filtered list comprehensions for readability. - single_type_kv_cache_manager.py: move 'block.is_pinned = True' inline with the SWA pin-loop append instead of a second pass over to_pin afterwards. - single_type_kv_cache_manager.py: TODO comment noting that the SWA drop-and-pin hook should ideally live on the SingleTypeKVCacheManager base (or a per-spec capability interface) so future Mamba-hybrid support does not need to edit this file. Operator UX (answering 'will pinning help my workload?'): - kv_cache_manager.py: at engine init, when VLLM_PIN_PREFIX_BLOCKS is set, log a one-line startup hint with the active pin env vars, the pool capacity in blocks/tokens, and a rule-of-thumb estimate of how many ~25k-token prefixes fit. Pinning delivers a win when the unique-prefix working set fits in ~80% of the pool; beyond that demote_n thrashes and hit rate collapses. Validated on 4xH200 with gemma-4-31B-IT, TRITON_ATTN, 30 prefixes conc=1: TTFT 1970 ms -> 499 ms (3.95x), KV usage 0% -> 64% post-warmup, sweep hit rate 0.22% -> 83.7%. pre-commit run -a clean. Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
|
Hi @jhaotingc, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @jhaotingc, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
4a0573e to
e6c18d1
Compare
Add opt-in pinning to preserve prefix-cache blocks through FIFO free-queue recycling in hybrid models like Gemma4. Without this, the last-window SWA blocks and full-attention blocks from completed requests are returned to the FIFO free queue and get recycled, evicting their hashes long before they would otherwise expire. This limits practical prefix-cache retention to ~20 requests even though the pool has room for ~170. Mechanism via ref_cnt manipulation: - VLLM_PIN_PREFIX_BLOCKS=1: allocated blocks start at ref_cnt=2. At SWA-DROP for out-of-window blocks, decrement ref_cnt by 2 so they fully release and rejoin the free queue. At end-of-request free, decrement by 1, leaving blocks at ref_cnt=1, pinned with hash intact and not in free queue but still reachable via cached_block_hash_to_block lookup. - VLLM_PIN_SWA_TOKENS=N: at each SWA-DROP, pin the most-recent N // block_size blocks being dropped ref_cnt 2 to 1 while fully freeing older blocks. This preserves chunk-boundary positions inside the shared prefix range, enabling SWA 64-contig cache-hit scan to succeed on future matching requests. - VLLM_PIN_MIN_DROP_SIZE=16: skip pinning when a SWA-DROP releases fewer than this many blocks. Decode-step drops carry unique-tail hashes with no prefix-match value; unconditional pinning bloats the pinned set until the pool is exhausted and new requests stall. Net effect for 60 prefix x 25k token workload on TP=4 bf16: - Per-prefix steady-state footprint: ~1,100 blocks Full plus SWA last-window - Pool of 189,245 blocks fits 60 prefixes comfortably - SWA prefix-cache hit rate: ~90% on cached prefixes, up from ~0% Files: - envs.py: declare and parse VLLM_PIN_PREFIX_BLOCKS, VLLM_PIN_SWA_TOKENS, VLLM_PIN_MIN_DROP_SIZE - block_pool.py: conditional ref_cnt=2 init; ref_cnt_delta param on free_blocks - single_type_kv_cache_manager.py: per-block pin-vs-free split in remove_skipped_blocks gated on drop-size threshold Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
…demotion Replaces the ref_cnt=2 pinning hack (from prior commit c302f5f) with an explicit is_pinned field on KVCacheBlock, and folds in pressure-based release of pinned blocks so the scheduler never stalls when pins exceed pool capacity. Motivation ---------- The ref_cnt=2 approach overloaded ref_cnt=1 to mean either live user OR pinned prefix-cache block. That ambiguity created four loopholes: 1. Shared block + SWA-DROP delta=2: when R2 cache-hits a pinned block from R1 (ref_cnt 1 to 2 via touch) and later SWA-DROPs it via the to_free path, delta=2 undoes both R2 touch and R1 pin at once. 2. Auto-track captured non-pin ref_cnt=1 transitions. 3. Unpin then cache hit then SWA-DROP created negative ref_cnt. 4. Pin status was lost across cache-hit-then-SWA-DROP cycles. Loophole 3 was the hang source: negative-ref_cnt blocks satisfy neither the ref_cnt==0 nor the ref_cnt==1 branch and leak permanently. The pool shrinks with each affected block until no admission can succeed. Redesign -------- - KVCacheBlock gains is_pinned: bool (default False). - ref_cnt is strictly the live-user count. All deltas are 1. - BlockPool has pinned_free_deque for (ref_cnt=0, is_pinned=True) blocks. - get_new_blocks pops from free_block_queue only; pinned_free_deque is drained only via demote_n under pressure. - free_blocks routes ref_cnt-zero blocks to free_block_queue or pinned_free_deque based on is_pinned. All deltas are 1. - touch updates ref_cnt but leaves is_pinned unchanged, so pins survive cache-hit-then-release cycles. - SWA-DROP sets is_pinned=True on to_pin candidates before calling free_blocks; to_free blocks keep their prior is_pinned value. - kv_cache_manager.free marks all non-null remaining blocks as is_pinned=True before releasing them, protecting the Full-attention prefix and the SWA last-window. Pressure-based release ---------------------- BlockPool.demote_n(n) flips is_pinned=False on the oldest pinned entries and moves them to free_block_queue. Hashes survive until _maybe_evict_cached_block fires on physical reuse, so demoted blocks remain cache-hit candidates until recycled. demote_n is invoked from two admission gates in kv_cache_manager so the scheduler cannot stall: - can_fit_full_sequence: fires when the scheduler reserves the full ISL and would reject the request before allocate_slots is called. - allocate_slots first admission check (capped budget) and second check (actual demand): both hook demote_n before returning None. Files ----- - kv_cache_utils.py: is_pinned field on KVCacheBlock. - block_pool.py: pinned_free_deque, demote_n, ref_cnt=1 alloc init, free_blocks routes by is_pinned (delta=1 always; null-block skipped to keep strict ref_cnt >= 0 invariant for real blocks), touch preserves is_pinned across pinned-tier stale entries. - single_type_kv_cache_manager.py: SWA-DROP flags is_pinned before free_blocks; to_pin and to_free both use delta=1. - kv_cache_manager.py: end-of-request free marks non-null blocks as is_pinned; pressure hooks in can_fit_full_sequence and allocate_slots (both the admission-budget and actual-demand checks). Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
- Pinned tier backed by FreeKVCacheBlockQueue; oldest entries released via demote_n wired into the admission gates. - SWA pin logic lives in SlidingWindowManager; base remove_skipped_blocks stays pinning-agnostic. Dead can_fit_full_sequence removed. - free_blocks fast-paths when pinning is off; lint/format fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Add VLLM_PIN_SWA_TOKENS (bool, off by default). When enabled, each SWA drop pins the current sliding-window blocks into a separate tier instead of freeing them, so the contiguous anchor a future request needs to hit the SWA prefix cache stays resident and is evicted last. Pinned blocks are demoted best-effort, oldest-first, under allocation pressure. Improves prefix-cache reuse for shared-prefix traffic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
e6c18d1 to
9ff6c1a
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
close because of duplicated of #43447 |
Purpose
Hybrid SWA + full-attention models (e.g. Gemma-3/4) get near-0% cross-request prefix-cache reuse once the prefix working set grows: as the sliding window advances, SWA layers drop out-of-window blocks and free them, the freed blocks rejoin the FIFO free queue, and they are recycled before the next request can reuse the shared prefix — even when the KV cache still has spare capacity.
This PR adds opt-in SWA prefix-cache pinning behind a single boolean knob
VLLM_PIN_SWA_TOKENS(defaultfalse). When enabled, each SWA window-drop PINS the current sliding-window blocks (one window per chunk) instead of freeing them, so the contiguous anchor a future request needs to hit the SWA prefix cache stays resident and is evicted last.Implementation: an
is_pinnedflag plus a secondpinned_block_queuetier inBlockPool;SlidingWindowManager.remove_skipped_blocksowns the pin policy while the base manager stays pinning-agnostic; pinned blocks remain registered in the prefix-cache hash map so they stay hittable; andBlockPool.demote_nreleases the oldest pinned blocks (best-effort) under allocation pressure so the scheduler never stalls. Full-attention layers are unchanged.VLLM_PIN_SWA_TOKENSfalseVLLM_PIN_MIN_DROP_SIZE16Test Plan
tests/v1/core/test_prefix_caching.py(SWA block release, admission gating, full-sequence admission).VLLM_PIN_SWA_TOKENSdiffering. KV cache = 1.47M tokens.VLLM_PIN_MIN_DROP_SIZEablation (16 vs 0) at 30 prefixes.pre-commit run --all-files.Test Result
Unit tests: 61 passed.
Prefix-working-set scaling — TTFT avg and output throughput (
—= not run):At 15 prefixes the working set fits the cache, nothing is evicted, and ON == OFF within noise (TTFT +0.6 ms, throughput identical) — pinning adds no measurable overhead when it is not needed.
Upstream main loses SWA reuse as early as 20 prefixes and re-prefills the full ~28k prefix per request (TTFT 403 → 1990 ms, 73 → 57 tok/s), while pinning (ON) keeps TTFT ~440–458 ms and ~73 tok/s through 60 prefixes (1.70M tokens, above the 1.47M cache). At 30 prefixes that is −78% TTFT and +28% throughput. Decode is unaffected throughout (ITL 12.66 ms in every run).
With this, maximum prefix cache can be stored by max-num-batched-token / window_size times, in this case (8k, 1k), 8x more prefix cache can be stored on a server.
VLLM_PIN_MIN_DROP_SIZEablation (ON, 30 prefixes): 16 vs 0 is perf-neutral — TTFT 440.0 vs 442.3 ms, throughput 72.8 vs 72.9 tok/s (within noise). The filter only matters under real pressure, where=0pins unique decode-tail blocks and adds demotion churn;16is the safe default.Accuracy is unchanged (pinning only changes which KV blocks are reused, not the computation): GSM8K 5-shot identical within noise on vs off (0.7127 / 0.7043 vs 0.7157 / 0.7043, flexible / strict), SCBench RepoQA Pass@1 73.0% on both.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.