[Model Runner V2] fix draft attention metadata generation#37364
Merged
WoosukKwon merged 1 commit intovllm-project:mainfrom Mar 20, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request refactors the initialization of attention metadata for the draft model in speculative decoding. The responsibility for creating attention groups for the draft model is moved into the EagleSpeculator, which now filters for its own attention layers. This is a good change that improves encapsulation. However, I've identified a critical bug in the logic for identifying the draft model's attention layers, which would likely cause speculative decoding to fail. My review includes a specific code suggestion to correct this.
fe42d33 to
06cb19e
Compare
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
06cb19e to
d57cd5f
Compare
WoosukKwon
approved these changes
Mar 20, 2026
Collaborator
WoosukKwon
left a comment
There was a problem hiding this comment.
LGTM Thanks! I think it'd be nice to verify this with Qwen 3.5 once it is supported with MRV2.
chooper26
pushed a commit
to intellistream/vllm-hust
that referenced
this pull request
Mar 21, 2026
…ct#37364) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Hybrid models + spec decoding currently builds metadata for the draft model decode-drafting phase for all KV cache groups, even though the draft model typically only belongs to one of them. This is wasteful.
This PR fixes this by simply skipping building attention groups that the draft model does not belong to, and using those to generate the attention metadata for decode-drafting. This is similar to what Model Runner V1 does.
Benchmarks
I used the following commands to benchmark and verify that there is no drop in acceptance rate. I benchmarked using openai/gpt-oss-20b, which is a hybrid model (alternates between sliding window and full attention). MRV2 main, MRV2 with my changes, and MRV1 results are very similar.
Server
Client