Skip to content

[Core] Disable HMA for eagle/MTP with sliding window models#39110

Closed
Bortlesboat wants to merge 1 commit intovllm-project:mainfrom
Bortlesboat:fix-hma-spec-decode-step3p5
Closed

[Core] Disable HMA for eagle/MTP with sliding window models#39110
Bortlesboat wants to merge 1 commit intovllm-project:mainfrom
Bortlesboat:fix-hma-spec-decode-step3p5

Conversation

@Bortlesboat
Copy link
Copy Markdown
Contributor

Step-3.5-Flash uses interleaved sliding window and full attention layers. When the hybrid KV cache manager (HMA) is enabled, it splits these into separate KV cache groups. The eagle/MTP drafter then fails during validate_same_kv_cache_group():

AssertionError: All drafting layers should belong to the same kv cache group

There's already a guard for chunked local attention + eagle at line 1233, but no equivalent for sliding window + eagle. This adds one — when speculative decoding is used with a model that has sliding_window set, HMA is auto-disabled so all layers stay in one KV cache group.

The reporter confirmed that manually setting need_disable_hybrid_kv_cache_manager = True resolves the crash.

Fixes #38498

Signed-off-by: Bortlesboat bortlesboat@users.noreply.github.com

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the configuration logic in vllm/config/vllm.py to disable the hybrid KV cache manager when EAGLE speculative decoding is used in conjunction with a model that has a sliding window. This ensures that all drafting layers remain within the same KV cache group, which is a requirement for EAGLE/MTP drafter validation. I have no feedback to provide as there were no review comments.

Models like Step-3.5-Flash use interleaved sliding window and full
attention layers. When HMA is enabled, these get split into separate
KV cache groups, which causes validate_same_kv_cache_group() in the
eagle/MTP drafter to fail with:

  AssertionError: All drafting layers should belong to the same kv
  cache group

This adds a guard to auto-disable HMA when speculative decoding
(eagle/MTP) is combined with a sliding window model, similar to the
existing guard for chunked local attention + eagle.

Fixes vllm-project#38498

Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
@Bortlesboat Bortlesboat force-pushed the fix-hma-spec-decode-step3p5 branch from e4528cc to 5580d9a Compare April 6, 2026 21:30
@Bortlesboat
Copy link
Copy Markdown
Contributor Author

Closing to resubmit with a clean branch rebased onto current main. The original branch diverged significantly due to intermediate merges. The fix is identical — only vllm/config/vllm.py, 12 lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][ROCm]: Step3.5 Flash MTP init error

1 participant