[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-cache hit by ivanium · Pull Request #42258 · vllm-project/vllm

ivanium · 2026-05-11T01:15:26Z

Purpose

DeepSeek-V4 pairs full-attention layers with SWA layers with different block sizes and window sizes. The full attn layers have block size of 256, while SWA layers and compressors have block sizes of 64, 4, or 8.

Within each 256-aligned segment, only the trailing tail = ceil((sliding_window - 1) / block_size) blocks are reachable by SWA's right-to-left scan.
Hash-blocks past the last 256-aligned boundary are unreachable because hits are always lcm-aligned.

This PR drops the unreachable blocks at cache-time via a new alignment_tokens kwarg threaded through cache_blocks -> SlidingWindowManager._cache_block_mask -> BlockPool.cache_full_blocks. Non-hybrid coordinators pass alignment_tokens=None and hit a fast path identical to the existing behavior.

Test Plan

New tests in tests/v1/core/test_prefix_caching.py:

test_hybrid_cache_blocks_swa_tail_window_only — full-attn block_size=32, SWA block_size=8, sliding_window=8 (lcm=32, tail=1, per_segment=4). After caching 8 SWA hash-blocks, asserts only hashes 3 and 7 are in the prefix-cache hash map; 0–2 and 4–6 are not.
test_hybrid_cache_blocks_clamped_to_lcm — full-attn block_size=32, SWA block_size=16, sliding_window=32. After caching 7 SWA hash-blocks (112 tokens), asserts hashes 0–5 are cached and hash 6 (past the last lcm boundary) is not.

Test Result

Passed. Claude is used to generate test cases but I have reviewed them.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request optimizes prefix caching for hybrid models by introducing a block masking mechanism, ensuring that only blocks capable of serving future cache hits (such as the tail window in Sliding Window Attention) are stored. Key changes include adding a block_mask to the BlockPool, implementing LCM-aligned caching in the KVCacheCoordinator, and updating cache managers to support sparse hit semantics. Review feedback highlighted potential assertion failures and memory leaks when handling shared physical blocks in models like DeepSeek-V4. Additionally, a bug was identified where token_ids in BlockStored events are not correctly filtered when blocks are masked, which could impact distributed caching systems.

gemini-code-assist · 2026-05-11T01:20:57Z

+                if block_mask is not None and not block_mask[i - num_cached_blocks]:
+                    continue


While extra_keys_list is correctly filtered by the block_mask, the token_ids slice passed to the BlockStored event (line 319) is not filtered. This results in a mismatch between the number of hashes and the number of tokens in the event when blocks are masked out (e.g., in SWA tail-window caching). This will likely break distributed caching consumers or offloading mechanisms that rely on these events. Please ensure token_ids is filtered to only include tokens for the blocks actually being cached.

njhill

Thanks @ivanium!

Looks like we should rebase now

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When Eagle/MTP speculative decoding is active the mask is too aggressive — it skips blocks that eagle's modified lookup actually needs, resulting in 0% prefix cache hit rate. Eagle changes the SWA hit logic in two ways: 1. sliding_window_contiguous_blocks += 1 (needs one extra block) 2. post_pop_blocks = i (instead of i+1), shifting alignment Fix: detect SWA managers inside eagle attention groups at coordinator init time and disable the cache block mask for them. Signed-off-by: Alex Bilichenko <abilichenko@gmail.com> (cherry picked from commit b90c495) Signed-off-by: jasl <jasl9187@hotmail.com>

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

ivanium requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners May 11, 2026 01:15

claude Bot reviewed May 11, 2026

View reviewed changes

mergify Bot added the v1 label May 11, 2026

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

ivanium mentioned this pull request May 11, 2026

[Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend #42296

Merged

4 tasks

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label May 13, 2026

njhill approved these changes May 14, 2026

View reviewed changes

feat: mask out SWA blocks that cannot get prefix cache hit

c106593

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

ivanium force-pushed the fix/dsv4-mask-blocks branch from 289c58c to c106593 Compare May 14, 2026 22:09

jeejeelee merged commit 4b364f8 into vllm-project:main May 15, 2026
56 checks passed

alexbi29 mentioned this pull request May 15, 2026

[Bugfix] Fix SWA cache block mask breaking prefix caching with Eagle/MTP #42784

Closed

4 tasks

thc1006 mentioned this pull request May 17, 2026

[Bug]: Based on Qwen3.5-35B-A3B, why does enabling MTP speculative decoding actually reduce the prefix cache hit rate? #38182

Open

1 task

omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026

[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-ca…

d011ebf

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026

[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-ca…

708146f

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-ca…

14cd8e8

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

orozery mentioned this pull request May 20, 2026

[kv_offload]: Add DSv4 support #43142

Merged

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-ca…

58fd8ad

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

vanshilshah97 mentioned this pull request May 20, 2026

v1/engine: emit prefix-cache KV-events at hash_block_size granularity for hybrid Mamba+Attention models #43258

Open

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-ca…

aca1427

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

ivanium mentioned this pull request May 30, 2026

[Bugfix] Cache the EAGLE/MTP lookahead block in the SWA prefix-cache mask #44082

Merged

andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026

[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-ca…

0e0da21

…che hit (vllm-project#42258) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-cache hit#42258

[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-cache hit#42258
jeejeelee merged 1 commit into
vllm-project:mainfrom
ivanium:fix/dsv4-mask-blocks

ivanium commented May 11, 2026 •

edited by github-actions Bot

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if block_mask is not None and not block_mask[i - num_cached_blocks]:
		continue

Uh oh!

Conversation

ivanium commented May 11, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivanium commented May 11, 2026 •

edited by github-actions Bot

Loading