Skip to content

[Core][KV] Retain prefix-cache across hybrid SWA+Full via is_pinned blocks#40676

Closed
jhaotingc wants to merge 4 commits into
vllm-project:mainfrom
jhaotingc:jhaotingc/gemma-4-swa-flush-OOW-first
Closed

[Core][KV] Retain prefix-cache across hybrid SWA+Full via is_pinned blocks#40676
jhaotingc wants to merge 4 commits into
vllm-project:mainfrom
jhaotingc:jhaotingc/gemma-4-swa-flush-OOW-first

Conversation

@jhaotingc

@jhaotingc jhaotingc commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Purpose

Hybrid SWA + full-attention models (e.g. Gemma-3/4) get near-0% cross-request prefix-cache reuse once the prefix working set grows: as the sliding window advances, SWA layers drop out-of-window blocks and free them, the freed blocks rejoin the FIFO free queue, and they are recycled before the next request can reuse the shared prefix — even when the KV cache still has spare capacity.

image

This PR adds opt-in SWA prefix-cache pinning behind a single boolean knob VLLM_PIN_SWA_TOKENS (default false). When enabled, each SWA window-drop PINS the current sliding-window blocks (one window per chunk) instead of freeing them, so the contiguous anchor a future request needs to hit the SWA prefix cache stays resident and is evicted last.

Implementation: an is_pinned flag plus a second pinned_block_queue tier in BlockPool; SlidingWindowManager.remove_skipped_blocks owns the pin policy while the base manager stays pinning-agnostic; pinned blocks remain registered in the prefix-cache hash map so they stay hittable; and BlockPool.demote_n releases the oldest pinned blocks (best-effort) under allocation pressure so the scheduler never stalls. Full-attention layers are unchanged.

Env var Default Purpose
VLLM_PIN_SWA_TOKENS false On/off switch for SWA prefix-cache pinning. When enabled, each SWA window-drop pins the current sliding-window blocks (one window per chunk) instead of freeing them.
VLLM_PIN_MIN_DROP_SIZE 16 Minimum drop size (in blocks) required to pin; filters out small decode-step drops.

Test Plan

  • Unit tests: tests/v1/core/test_prefix_caching.py (SWA block release, admission gating, full-sequence admission).
  • A/B serving on Gemma-4-31B-IT, TP4, H200, conc=1, OSL=400, prefix caching on, sweeping the number of distinct ~28k-token prefixes (the working-set size) with only VLLM_PIN_SWA_TOKENS differing. KV cache = 1.47M tokens.
  • VLLM_PIN_MIN_DROP_SIZE ablation (16 vs 0) at 30 prefixes.
  • Accuracy: GSM8K 5-shot and SCBench RepoQA, pinning on vs off.
  • Lint: full pre-commit run --all-files.

Test Result

Unit tests: 61 passed.

Prefix-working-set scaling — TTFT avg and output throughput ( = not run):

Prefixes (total input) TTFT (ms) (main) TTFT (ms) PR tok/s main tok/s PR
15 (0.43M, fits cache) 403 404 73.3 73.3
30 (0.85M) 1992 440 56.8 72.8
50 (1.42M) 458 74.2
60 (1.70M) 452 72.7

At 15 prefixes the working set fits the cache, nothing is evicted, and ON == OFF within noise (TTFT +0.6 ms, throughput identical) — pinning adds no measurable overhead when it is not needed.
Upstream main loses SWA reuse as early as 20 prefixes and re-prefills the full ~28k prefix per request (TTFT 403 → 1990 ms, 73 → 57 tok/s), while pinning (ON) keeps TTFT ~440–458 ms and ~73 tok/s through 60 prefixes (1.70M tokens, above the 1.47M cache). At 30 prefixes that is −78% TTFT and +28% throughput. Decode is unaffected throughout (ITL 12.66 ms in every run).

With this, maximum prefix cache can be stored by max-num-batched-token / window_size times, in this case (8k, 1k), 8x more prefix cache can be stored on a server.

VLLM_PIN_MIN_DROP_SIZE ablation (ON, 30 prefixes): 16 vs 0 is perf-neutral — TTFT 440.0 vs 442.3 ms, throughput 72.8 vs 72.9 tok/s (within noise). The filter only matters under real pressure, where =0 pins unique decode-tail blocks and adds demotion churn; 16 is the safe default.

Accuracy is unchanged (pinning only changes which KV blocks are reused, not the computation): GSM8K 5-shot identical within noise on vs off (0.7127 / 0.7043 vs 0.7157 / 0.7043, flexible / strict), SCBench RepoQA Pass@1 73.0% on both.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify Bot added the v1 label Apr 23, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a prefix-cache pinning mechanism to enhance cache retention, managed through new environment variables. It introduces a pinned tier for free blocks that are only demoted to the regular free queue under memory pressure. The review feedback suggests refactoring the pinned_free_deque to use the existing FreeKVCacheBlockQueue infrastructure. This change would allow for O(1) block removal during touch operations and more efficient batch processing in the demote_n and free_blocks methods, avoiding potential performance bottlenecks and stale entries associated with the current deque implementation.

Comment thread vllm/v1/core/block_pool.py Outdated
Comment on lines +184 to +190
# Oldest-first deque of blocks at ref_cnt=0 AND is_pinned=True.
# These blocks are prefix-cache retention candidates. They are
# NOT drained by get_new_blocks directly; demote_n() under
# pressure flips is_pinned=False and moves them to free_block_queue.
from collections import deque as _deque

self.pinned_free_deque: _deque = _deque()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a collections.deque for pinned_free_deque introduces a potential memory leak and performance bottleneck. Since touch (line 414) skips removing blocks from this deque to avoid O(N) complexity, the deque can accumulate a large number of stale entries (blocks that have been re-activated or even re-freed into the regular queue). In a long-running server, this deque could grow significantly, and demote_n would have to iterate through many stale entries.

Instead, you should leverage the existing O(1) doubly linked list infrastructure. By using another instance of FreeKVCacheBlockQueue, you can achieve O(1) removal in touch without stale entries, and O(1) batch operations in demote_n and free_blocks, all while reusing the prev_free_block and next_free_block pointers already present in KVCacheBlock (since a block is either in the regular free queue, the pinned queue, or active).

Suggested change
# Oldest-first deque of blocks at ref_cnt=0 AND is_pinned=True.
# These blocks are prefix-cache retention candidates. They are
# NOT drained by get_new_blocks directly; demote_n() under
# pressure flips is_pinned=False and moves them to free_block_queue.
from collections import deque as _deque
self.pinned_free_deque: _deque = _deque()
# Oldest-first queue of blocks at ref_cnt=0 AND is_pinned=True.
# These blocks are prefix-cache retention candidates. They are
# NOT drained by get_new_blocks directly; demote_n() under
# pressure flips is_pinned=False and moves them to free_block_queue.
# Reuses the same linked-list pointers as free_block_queue.
from vllm.v1.core.kv_cache_utils import FreeKVCacheBlockQueue
self.pinned_block_queue = FreeKVCacheBlockQueue([])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. __init__ now constructs an empty FreeKVCacheBlockQueue for the pinned tier. The invariant that a block lives in exactly one of free_block_queue (is_pinned=False) or pinned_block_queue (is_pinned=True) is already enforced by the is_pinned routing in free_blocks(), so the shared prev_free_block/next_free_block pointers on KVCacheBlock are safe.

Comment thread vllm/v1/core/block_pool.py Outdated
Comment on lines +411 to +420
# ref_cnt=0 means this block is in some free tier (regular
# queue if is_pinned=False, pinned_free_deque if is_pinned=True).
if block.ref_cnt == 0 and not block.is_null:
self.free_block_queue.remove(block)
if block.is_pinned:
# Stale entries are common after demote_n; avoid O(n)
# removal by leaving the stale entry in place — demote_n
# will skip it on the next pop.
pass
else:
self.free_block_queue.remove(block)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

With the suggested refactor to use FreeKVCacheBlockQueue for pinned blocks, touch can now perform O(1) removal for both regular and pinned tiers. This eliminates the need for stale entries and keeps the data structures clean. Additionally, it is safer to explicitly reset is_pinned to False when a block is activated via touch to ensure its state is consistent with its active status.

            # ref_cnt=0 means this block is in some free tier (regular
            # queue if is_pinned=False, pinned_block_queue if is_pinned=True).
            if block.ref_cnt == 0 and not block.is_null:
                if block.is_pinned:
                    self.pinned_block_queue.remove(block)
                    block.is_pinned = False
                else:
                    self.free_block_queue.remove(block)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. touch now calls self.pinned_block_queue.remove(block) (O(1) via prev/next pointers) when the block is in the pinned tier, eliminating stale-entry accumulation.

Comment thread vllm/v1/core/block_pool.py Outdated
Comment thread vllm/v1/core/block_pool.py Outdated
jhaotingc added a commit to jhaotingc/vllm that referenced this pull request Apr 24, 2026
… review)

Replace collections.deque with a second FreeKVCacheBlockQueue instance
(pinned_block_queue) for the ref_cnt=0 && is_pinned=True tier. This
addresses Copilot review comments on PR vllm-project#40676:

- touch() now does O(1) remove() from either queue via the block prev/next
  pointers; no more stale-entry accumulation in the pinned deque.
- demote_n() uses batched popleft_n + append_n instead of a per-block loop
  that updated tail pointers on every iteration.
- free_blocks() batches both tiers with append_n for consistency.

Invariant: a block is in exactly one of free_block_queue (is_pinned=False)
or pinned_block_queue (is_pinned=True), never both -- the prev/next
pointers on KVCacheBlock only support one linked list at a time. This is
already guaranteed by the is_pinned routing in free_blocks().

Semantics unchanged: touch() leaves is_pinned untouched so a later
free_blocks() can re-route to the pinned tier when still a retention
candidate. Pins survive cache-hit-then-release cycles.

Validated on Gemma-4-31B-it TP=4 H200 48-prefix sweep (28k ISL):
- Warmup (cold) TTFT avg: 1364 ms
- Sweep (warm)  TTFT avg: 305 ms (4.47x faster, p99 12.85x faster)
- Full prefix-cache hit confirmed on 2nd pass; no hangs at pool limit.

Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
@jhaotingc

Copy link
Copy Markdown
Contributor Author

@claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@jhaotingc

Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread vllm/v1/core/kv_cache_manager.py Outdated
# Cannot allocate new blocks
return None
num_free_blocks = self.block_pool.get_num_free_blocks()
if num_blocks_to_allocate > num_free_blocks:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to guard here with and envs.VLLM_PIN_PREFIX_BLOCKS as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thank you!

Comment thread vllm/v1/core/block_pool.py Outdated
# Route ref_cnt==0 blocks to the correct tier; batch both.
regular_free: list[KVCacheBlock] = []
pinned_free: list[KVCacheBlock] = []
for block in blocks_list:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which is better (or if it actually impacts performance), but have you tried profiling doing two list comprehensions, instead of a loop and appending to lists?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to list comprehensions. Thank you!

Comment on lines +429 to +430
for b in to_pin:
b.is_pinned = True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do this inside the loop above, before (or after) the to_pin.append call

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines +395 to +398
from vllm.v1.kv_cache_interface import SlidingWindowSpec

if isinstance(self.kv_cache_spec, SlidingWindowSpec):
pin_blocks = envs.VLLM_PIN_SWA_TOKENS // self.block_size

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at how this feature affects Mamba-hybrid models? I'm wondering if we could generalize this, in a way such that adding fully-fledged Mamba support won't require changes in this file, for example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mamba already has 3 cache mode, 'none', 'align', and 'full'.
For 'full' mode, it keeps the last mamba state of every chunks, so if chunk size is 8k and 64k ISL, it keeps all the 8k states.
For 'align' mode, it only keeps partial chunks (say a chunk size is 8k, for a 64k ISL, it may keep arbitarary any 8k states).
For 'none' mode, it only keeps the very last state.

In another word, this sliding window pining "frees up" the OOW windows earlier than the last windows, but mamba already keeps only the chunk edge states, the caching is already limited to chunk edges and there's no intermediate mamba states stored. So I think this is not generalizable.

@mergify

mergify Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jhaotingc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
@jhaotingc jhaotingc force-pushed the jhaotingc/gemma-4-swa-flush-OOW-first branch from ec6bd2f to 1f5a2c0 Compare May 28, 2026 01:53
jhaotingc added a commit to jhaotingc/vllm that referenced this pull request May 28, 2026
… review)

Replace collections.deque with a second FreeKVCacheBlockQueue instance
(pinned_block_queue) for the ref_cnt=0 && is_pinned=True tier. This
addresses Copilot review comments on PR vllm-project#40676:

- touch() now does O(1) remove() from either queue via the block prev/next
  pointers; no more stale-entry accumulation in the pinned deque.
- demote_n() uses batched popleft_n + append_n instead of a per-block loop
  that updated tail pointers on every iteration.
- free_blocks() batches both tiers with append_n for consistency.

Invariant: a block is in exactly one of free_block_queue (is_pinned=False)
or pinned_block_queue (is_pinned=True), never both -- the prev/next
pointers on KVCacheBlock only support one linked list at a time. This is
already guaranteed by the is_pinned routing in free_blocks().

Semantics unchanged: touch() leaves is_pinned untouched so a later
free_blocks() can re-route to the pinned tier when still a retention
candidate. Pins survive cache-hit-then-release cycles.

Validated on Gemma-4-31B-it TP=4 H200 48-prefix sweep (28k ISL):
- Warmup (cold) TTFT avg: 1364 ms
- Sweep (warm)  TTFT avg: 305 ms (4.47x faster, p99 12.85x faster)
- Full prefix-cache hit confirmed on 2nd pass; no hangs at pool limit.

Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
jhaotingc added a commit to jhaotingc/vllm that referenced this pull request May 28, 2026
… paths

After rebasing the prefix-cache pinning series (vllm-project#40676) onto
upstream/main, two newly-added code paths needed wiring before the
pin/demote mechanism could actually function under pressure, and a
round of reviewer feedback applied.

Rebase fixes (engine deadlocked without these):
- kv_cache_manager.py: the upstream-added `full_sequence_must_fit`
  admission gate in allocate_slots returns None without giving the
  pinned tier a chance to release. Add a VLLM_PIN_PREFIX_BLOCKS-guarded
  demote_n call inside that branch so the existing pressure-recovery
  logic engages before admission is refused.
- single_type_kv_cache_manager.py: the upstream-added
  SlidingWindowManager._cache_block_mask elides older SWA-segment
  blocks from the prefix-cache hash map ('they get dropped anyway,
  never serve a hit'). That defeats VLLM_PIN_SWA_TOKENS, whose entire
  purpose is to keep those blocks alive for future hits. Short-circuit
  the mask to None when either pin flag is set.

PR review (@roikoren755):
- kv_cache_manager.py: guard the lower allocate_slots demote site
  with envs.VLLM_PIN_PREFIX_BLOCKS so it is a no-op for users who do
  not opt in.
- block_pool.py: refactor the free_blocks routing from a single loop
  with two appends into two filtered list comprehensions for
  readability.
- single_type_kv_cache_manager.py: move 'block.is_pinned = True'
  inline with the SWA pin-loop append instead of a second pass over
  to_pin afterwards.
- single_type_kv_cache_manager.py: TODO comment noting that the SWA
  drop-and-pin hook should ideally live on the SingleTypeKVCacheManager
  base (or a per-spec capability interface) so future Mamba-hybrid
  support does not need to edit this file.

Operator UX (answering 'will pinning help my workload?'):
- kv_cache_manager.py: at engine init, when VLLM_PIN_PREFIX_BLOCKS is
  set, log a one-line startup hint with the active pin env vars, the
  pool capacity in blocks/tokens, and a rule-of-thumb estimate of how
  many ~25k-token prefixes fit. Pinning delivers a win when the
  unique-prefix working set fits in ~80% of the pool; beyond that
  demote_n thrashes and hit rate collapses.

Validated on 4xH200 with gemma-4-31B-IT, TRITON_ATTN, 30 prefixes
conc=1: TTFT 1970 ms -> 499 ms (3.95x), KV usage 0% -> 64% post-warmup,
sweep hit rate 0.22% -> 83.7%. pre-commit run -a clean.

Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
@mergify mergify Bot removed the needs-rebase label May 28, 2026
@mergify

mergify Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Hi @jhaotingc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

1 similar comment
@mergify

mergify Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Hi @jhaotingc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@jhaotingc jhaotingc force-pushed the jhaotingc/gemma-4-swa-flush-OOW-first branch 2 times, most recently from 4a0573e to e6c18d1 Compare May 29, 2026 22:45
jhaotingc and others added 4 commits May 29, 2026 20:20
Add opt-in pinning to preserve prefix-cache blocks through FIFO free-queue
recycling in hybrid models like Gemma4. Without this, the last-window SWA
blocks and full-attention blocks from completed requests are returned to the
FIFO free queue and get recycled, evicting their hashes long before they
would otherwise expire. This limits practical prefix-cache retention to ~20
requests even though the pool has room for ~170.

Mechanism via ref_cnt manipulation:

- VLLM_PIN_PREFIX_BLOCKS=1: allocated blocks start at ref_cnt=2. At SWA-DROP
  for out-of-window blocks, decrement ref_cnt by 2 so they fully release and
  rejoin the free queue. At end-of-request free, decrement by 1, leaving
  blocks at ref_cnt=1, pinned with hash intact and not in free queue but
  still reachable via cached_block_hash_to_block lookup.

- VLLM_PIN_SWA_TOKENS=N: at each SWA-DROP, pin the most-recent
  N // block_size blocks being dropped ref_cnt 2 to 1 while fully freeing
  older blocks. This preserves chunk-boundary positions inside the shared
  prefix range, enabling SWA 64-contig cache-hit scan to succeed on future
  matching requests.

- VLLM_PIN_MIN_DROP_SIZE=16: skip pinning when a SWA-DROP releases fewer
  than this many blocks. Decode-step drops carry unique-tail hashes with no
  prefix-match value; unconditional pinning bloats the pinned set until the
  pool is exhausted and new requests stall.

Net effect for 60 prefix x 25k token workload on TP=4 bf16:
- Per-prefix steady-state footprint: ~1,100 blocks Full plus SWA last-window
- Pool of 189,245 blocks fits 60 prefixes comfortably
- SWA prefix-cache hit rate: ~90% on cached prefixes, up from ~0%

Files:
- envs.py: declare and parse VLLM_PIN_PREFIX_BLOCKS, VLLM_PIN_SWA_TOKENS,
  VLLM_PIN_MIN_DROP_SIZE
- block_pool.py: conditional ref_cnt=2 init; ref_cnt_delta param on
  free_blocks
- single_type_kv_cache_manager.py: per-block pin-vs-free split in
  remove_skipped_blocks gated on drop-size threshold

Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
…demotion

Replaces the ref_cnt=2 pinning hack (from prior commit c302f5f) with an
explicit is_pinned field on KVCacheBlock, and folds in pressure-based
release of pinned blocks so the scheduler never stalls when pins exceed
pool capacity.

Motivation
----------
The ref_cnt=2 approach overloaded ref_cnt=1 to mean either live user OR
pinned prefix-cache block. That ambiguity created four loopholes:

1. Shared block + SWA-DROP delta=2: when R2 cache-hits a pinned block
   from R1 (ref_cnt 1 to 2 via touch) and later SWA-DROPs it via the
   to_free path, delta=2 undoes both R2 touch and R1 pin at once.
2. Auto-track captured non-pin ref_cnt=1 transitions.
3. Unpin then cache hit then SWA-DROP created negative ref_cnt.
4. Pin status was lost across cache-hit-then-SWA-DROP cycles.

Loophole 3 was the hang source: negative-ref_cnt blocks satisfy neither
the ref_cnt==0 nor the ref_cnt==1 branch and leak permanently. The pool
shrinks with each affected block until no admission can succeed.

Redesign
--------
- KVCacheBlock gains is_pinned: bool (default False).
- ref_cnt is strictly the live-user count. All deltas are 1.
- BlockPool has pinned_free_deque for (ref_cnt=0, is_pinned=True) blocks.
- get_new_blocks pops from free_block_queue only; pinned_free_deque is
  drained only via demote_n under pressure.
- free_blocks routes ref_cnt-zero blocks to free_block_queue or
  pinned_free_deque based on is_pinned. All deltas are 1.
- touch updates ref_cnt but leaves is_pinned unchanged, so pins survive
  cache-hit-then-release cycles.
- SWA-DROP sets is_pinned=True on to_pin candidates before calling
  free_blocks; to_free blocks keep their prior is_pinned value.
- kv_cache_manager.free marks all non-null remaining blocks as
  is_pinned=True before releasing them, protecting the Full-attention
  prefix and the SWA last-window.

Pressure-based release
----------------------
BlockPool.demote_n(n) flips is_pinned=False on the oldest pinned entries
and moves them to free_block_queue. Hashes survive until
_maybe_evict_cached_block fires on physical reuse, so demoted blocks
remain cache-hit candidates until recycled.

demote_n is invoked from two admission gates in kv_cache_manager so the
scheduler cannot stall:
- can_fit_full_sequence: fires when the scheduler reserves the full ISL
  and would reject the request before allocate_slots is called.
- allocate_slots first admission check (capped budget) and second check
  (actual demand): both hook demote_n before returning None.

Files
-----
- kv_cache_utils.py: is_pinned field on KVCacheBlock.
- block_pool.py: pinned_free_deque, demote_n, ref_cnt=1 alloc init,
  free_blocks routes by is_pinned (delta=1 always; null-block skipped to
  keep strict ref_cnt >= 0 invariant for real blocks), touch preserves
  is_pinned across pinned-tier stale entries.
- single_type_kv_cache_manager.py: SWA-DROP flags is_pinned before
  free_blocks; to_pin and to_free both use delta=1.
- kv_cache_manager.py: end-of-request free marks non-null blocks as
  is_pinned; pressure hooks in can_fit_full_sequence and
  allocate_slots (both the admission-budget and actual-demand checks).

Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
- Pinned tier backed by FreeKVCacheBlockQueue; oldest entries released via
  demote_n wired into the admission gates.
- SWA pin logic lives in SlidingWindowManager; base remove_skipped_blocks
  stays pinning-agnostic. Dead can_fit_full_sequence removed.
- free_blocks fast-paths when pinning is off; lint/format fixes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Add VLLM_PIN_SWA_TOKENS (bool, off by default). When enabled, each SWA
drop pins the current sliding-window blocks into a separate tier instead
of freeing them, so the contiguous anchor a future request needs to hit
the SWA prefix cache stays resident and is evicted last. Pinned blocks
are demoted best-effort, oldest-first, under allocation pressure.
Improves prefix-cache reuse for shared-prefix traffic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
@jhaotingc jhaotingc force-pushed the jhaotingc/gemma-4-swa-flush-OOW-first branch from e6c18d1 to 9ff6c1a Compare May 30, 2026 05:41
@mergify

mergify Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jhaotingc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 3, 2026
@jhaotingc

Copy link
Copy Markdown
Contributor Author

close because of duplicated of #43447

@jhaotingc jhaotingc closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants