[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-cache hit#42258
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes prefix caching for hybrid models by introducing a block masking mechanism, ensuring that only blocks capable of serving future cache hits (such as the tail window in Sliding Window Attention) are stored. Key changes include adding a block_mask to the BlockPool, implementing LCM-aligned caching in the KVCacheCoordinator, and updating cache managers to support sparse hit semantics. Review feedback highlighted potential assertion failures and memory leaks when handling shared physical blocks in models like DeepSeek-V4. Additionally, a bug was identified where token_ids in BlockStored events are not correctly filtered when blocks are masked, which could impact distributed caching systems.
| if block_mask is not None and not block_mask[i - num_cached_blocks]: | ||
| continue |
There was a problem hiding this comment.
While extra_keys_list is correctly filtered by the block_mask, the token_ids slice passed to the BlockStored event (line 319) is not filtered. This results in a mismatch between the number of hashes and the number of tokens in the event when blocks are masked out (e.g., in SWA tail-window caching). This will likely break distributed caching consumers or offloading mechanisms that rely on these events. Please ensure token_ids is filtered to only include tokens for the blocks actually being cached.
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
289c58c to
c106593
Compare
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>
…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Purpose
DeepSeek-V4 pairs full-attention layers with SWA layers with different block sizes and window sizes. The full attn layers have block size of 256, while SWA layers and compressors have block sizes of 64, 4, or 8.
tail = ceil((sliding_window - 1) / block_size)blocks are reachable by SWA's right-to-left scan.This PR drops the unreachable blocks at cache-time via a new
alignment_tokenskwarg threaded throughcache_blocks->SlidingWindowManager._cache_block_mask->BlockPool.cache_full_blocks. Non-hybrid coordinators passalignment_tokens=Noneand hit a fast path identical to the existing behavior.Test Plan
New tests in
tests/v1/core/test_prefix_caching.py:test_hybrid_cache_blocks_swa_tail_window_only— full-attnblock_size=32, SWAblock_size=8,sliding_window=8(lcm=32,tail=1,per_segment=4). After caching 8 SWA hash-blocks, asserts only hashes 3 and 7 are in the prefix-cache hash map; 0–2 and 4–6 are not.test_hybrid_cache_blocks_clamped_to_lcm— full-attnblock_size=32, SWAblock_size=16,sliding_window=32. After caching 7 SWA hash-blocks (112 tokens), asserts hashes 0–5 are cached and hash 6 (past the last lcm boundary) is not.Test Result
Passed. Claude is used to generate test cases but I have reviewed them.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.