[Core] Disable HMA for eagle/MTP with sliding window models#39110
Closed
Bortlesboat wants to merge 1 commit intovllm-project:mainfrom
Closed
[Core] Disable HMA for eagle/MTP with sliding window models#39110Bortlesboat wants to merge 1 commit intovllm-project:mainfrom
Bortlesboat wants to merge 1 commit intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the configuration logic in vllm/config/vllm.py to disable the hybrid KV cache manager when EAGLE speculative decoding is used in conjunction with a model that has a sliding window. This ensures that all drafting layers remain within the same KV cache group, which is a requirement for EAGLE/MTP drafter validation. I have no feedback to provide as there were no review comments.
Models like Step-3.5-Flash use interleaved sliding window and full attention layers. When HMA is enabled, these get split into separate KV cache groups, which causes validate_same_kv_cache_group() in the eagle/MTP drafter to fail with: AssertionError: All drafting layers should belong to the same kv cache group This adds a guard to auto-disable HMA when speculative decoding (eagle/MTP) is combined with a sliding window model, similar to the existing guard for chunked local attention + eagle. Fixes vllm-project#38498 Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
e4528cc to
5580d9a
Compare
Contributor
Author
|
Closing to resubmit with a clean branch rebased onto current main. The original branch diverged significantly due to intermediate merges. The fix is identical — only vllm/config/vllm.py, 12 lines. |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Step-3.5-Flash uses interleaved sliding window and full attention layers. When the hybrid KV cache manager (HMA) is enabled, it splits these into separate KV cache groups. The eagle/MTP drafter then fails during
validate_same_kv_cache_group():There's already a guard for chunked local attention + eagle at line 1233, but no equivalent for sliding window + eagle. This adds one — when speculative decoding is used with a model that has
sliding_windowset, HMA is auto-disabled so all layers stay in one KV cache group.The reporter confirmed that manually setting
need_disable_hybrid_kv_cache_manager = Trueresolves the crash.Fixes #38498
Signed-off-by: Bortlesboat bortlesboat@users.noreply.github.com