[Core] Disable HMA for eagle/MTP with sliding window models by Bortlesboat · Pull Request #39110 · vllm-project/vllm

Bortlesboat · 2026-04-06T20:40:53Z

Step-3.5-Flash uses interleaved sliding window and full attention layers. When the hybrid KV cache manager (HMA) is enabled, it splits these into separate KV cache groups. The eagle/MTP drafter then fails during validate_same_kv_cache_group():

AssertionError: All drafting layers should belong to the same kv cache group

There's already a guard for chunked local attention + eagle at line 1233, but no equivalent for sliding window + eagle. This adds one — when speculative decoding is used with a model that has sliding_window set, HMA is auto-disabled so all layers stay in one KV cache group.

The reporter confirmed that manually setting need_disable_hybrid_kv_cache_manager = True resolves the crash.

Fixes #38498

Signed-off-by: Bortlesboat bortlesboat@users.noreply.github.com

gemini-code-assist

Code Review

This pull request updates the configuration logic in vllm/config/vllm.py to disable the hybrid KV cache manager when EAGLE speculative decoding is used in conjunction with a model that has a sliding window. This ensures that all drafting layers remain within the same KV cache group, which is a requirement for EAGLE/MTP drafter validation. I have no feedback to provide as there were no review comments.

Models like Step-3.5-Flash use interleaved sliding window and full attention layers. When HMA is enabled, these get split into separate KV cache groups, which causes validate_same_kv_cache_group() in the eagle/MTP drafter to fail with: AssertionError: All drafting layers should belong to the same kv cache group This adds a guard to auto-disable HMA when speculative decoding (eagle/MTP) is combined with a sliding window model, similar to the existing guard for chunked local attention + eagle. Fixes vllm-project#38498 Signed-off-by: Bortlesboat <bortstheboat@gmail.com>

Bortlesboat · 2026-04-09T02:02:40Z

Closing to resubmit with a clean branch rebased onto current main. The original branch diverged significantly due to intermediate merges. The fix is identical — only vllm/config/vllm.py, 12 lines.

Bortlesboat requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners April 6, 2026 20:40

gemini-code-assist bot reviewed Apr 6, 2026

View reviewed changes

Bortlesboat force-pushed the fix-hma-spec-decode-step3p5 branch from e4528cc to 5580d9a Compare April 6, 2026 21:30

Bortlesboat closed this Apr 9, 2026

panpan0000 mentioned this pull request Apr 14, 2026

Introduce De-dup/Similarity-Check in CI Workflow for PR/Issue #39695

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Disable HMA for eagle/MTP with sliding window models#39110

[Core] Disable HMA for eagle/MTP with sliding window models#39110
Bortlesboat wants to merge 1 commit intovllm-project:mainfrom
Bortlesboat:fix-hma-spec-decode-step3p5

Bortlesboat commented Apr 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Bortlesboat commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Bortlesboat commented Apr 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Bortlesboat commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant