[Bugfix] Cache the EAGLE/MTP lookahead block in the SWA prefix-cache mask#44082
Conversation
ad183e7 to
8a1248b
Compare
| block_pool: The block pool. | ||
| kv_cache_spec: The kv cache spec. | ||
| use_eagle: Whether to use eagle. | ||
| drop_eagle: Whether to drop the last matched block for EAGLE/MTP. |
There was a problem hiding this comment.
I find variable name drop_eagle less intuitive to understand than use_eagle + a comment suggesting eagle requires dropping the last matched block. Is there distinction between the two names?
There was a problem hiding this comment.
Yes. I keep use_eagle as a member variable of SingleTypeKVCacheManager, since it is an instance property. In contrast, drop_eagle is specific to this class method and can be set to False even for a manager with EAGLE layers. For example, this happens during the convergence loop, where the drop has already been applied.
There was a problem hiding this comment.
Got it. Thanks for the explanation.
nit: Maybe it would be better to rename it to something like drop_eagle_block and add a comment saying this could be false even when for kv manager with eager layers.
8a1248b to
66ad620
Compare
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
66ad620 to
7056962
Compare
…mask (vllm-project#44082) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…mask (vllm-project#44082) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
…mask (vllm-project#44082) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
…mask (vllm-project#44082) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: JisoLya <523420504@qq.com>
Summary
PR #42258 added
SlidingWindowManager._cache_block_mask()to skip caching SWA blocks that can never serve a prefix-cache hit. When EAGLE/MTP speculative decoding is active and the cache-hit alignment (the LCM of per-group block sizes) is larger than the SWA window, that mask is too aggressive: EAGLE's lookup needstail + 1contiguous cached blocks, and the extra+1block lives at the first position past each aligned segment boundary — exactly a position the mask skipped. The result is that EAGLE + SWA finds no prefix-cache hit.This PR fixes the mask so the one extra lookahead block EAGLE requires is cached, while keeping the rest of the #42258 optimization intact.
Note
MooncakeStoreConnector also needs a similar fix. Will be addressed in a follow-up PR.
SimpleCPUOffloadingConnector is not affected because it reuses KVCacheCoordinator logic directly.
What changed
SlidingWindowManager: factor the "contiguous blocks needed for a hit" calculation into_contiguous_blocks_for_hit(window, block_size, use_eagle), shared by both the cache-hit lookup and the cache mask so they stay in sync.reachable_block_mask(formerly_cache_block_mask) to mark a block reachable iff it falls in theneed-wide run ending at an aligned boundary's right edge, applying the EAGLEshift=1when EAGLE is active. This keeps the EAGLE lookahead block (one past the boundary) eligible to be cached.HybridKVCacheCoordinator.cache_blocks: when a manager is an EAGLE group, extendnum_tokens_to_cacheby one block past the aligned boundary so the lookahead block is actually written into the prefix-cache hash map.SpecGroupNamedTuple to carry the per-spec-groupuse_eaglebit explicitly, and propagate it to eachSingleTypeKVCacheManager(grouped SWA siblings share one cache-hit lookup, so the EAGLE drop is decided per spec group). This replaces the ad-hoceagle_attn_group_indicesset.use_eagleparameter offind_longest_cache_hittodrop_eagleto disambiguate "this group uses EAGLE" (a manager attribute) from "drop the last matched block on this lookup pass" (a per-pass decision in the hybrid fixed-point loop).Why this is not duplicating PR #42784
PR #42784 fixes the same underlying bug but by disabling the SWA cache mask entirely whenever EAGLE is active — which caches every SWA block and so gives up the memory-saving benefit that #42258 was added to provide.
This PR instead preserves the #42258 optimization and caches only the single additional lookahead block that EAGLE actually needs per aligned segment. The masking logic and the cache-hit lookup are driven from one shared helper, so they cannot drift apart. It also covers the grouped-SWA-siblings case (multiple SWA groups sharing one spec, one of which is an EAGLE/MTP group) and the
block_size != alignment_tokens(Gemma-style different-page-size) path.I am happy to consolidate with #42784 if maintainers prefer one approach.
Test plan
.venv/bin/python -m pytest \ tests/v1/core/test_prefix_caching.py \ tests/v1/core/test_single_type_kv_cache_manager.py -q # 72 passedNew regression tests added:
test_eagle_swa_alignment_caches_extra_block— EAGLE + SWA withsliding_window <= alignmentfinds a non-zero cache hit.test_eagle_swa_boundary_caches_post_boundary_block— the first block past an alignment boundary (the EAGLE lookahead block) is cached.test_eagle_grouped_swa_siblings_use_same_cache_mask— grouped SWA siblings cache the lookahead block together.Lint:
pre-commit run --files <changed files>— ruff check, ruff format, and mypy all pass.Notes
This change was developed with AI assistance (Claude Code). The submitter has reviewed every changed line and run the tests above.