Skip to content

[Model Runner V2] fix draft attention metadata generation#37364

Merged
WoosukKwon merged 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-spec-decode-draft-attn-groups
Mar 20, 2026
Merged

[Model Runner V2] fix draft attention metadata generation#37364
WoosukKwon merged 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-spec-decode-draft-attn-groups

Conversation

@TheEpicDolphin
Copy link
Collaborator

@TheEpicDolphin TheEpicDolphin commented Mar 18, 2026

Purpose

Hybrid models + spec decoding currently builds metadata for the draft model decode-drafting phase for all KV cache groups, even though the draft model typically only belongs to one of them. This is wasteful.

This PR fixes this by simply skipping building attention groups that the draft model does not belong to, and using those to generate the attention metadata for decode-drafting. This is similar to what Model Runner V1 does.

Benchmarks

I used the following commands to benchmark and verify that there is no drop in acceptance rate. I benchmarked using openai/gpt-oss-20b, which is a hybrid model (alternates between sliding window and full attention). MRV2 main, MRV2 with my changes, and MRV1 results are very similar.

Server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve openai/gpt-oss-20b --no-enable-prefix-caching --tensor-parallel-size=1 --data-parallel-size=1 --speculative-config '{"method": "eagle3", "model": "RedHatAI/gpt-oss-20b-speculator.eagle3", "num_speculative_tokens": 3}'

Client

vllm bench serve --model openai/gpt-oss-20b --tokenizer openai/gpt-oss-20b --host 0.0.0.0 --dataset-name hf --dataset-path philschmid/mt-bench --ignore-eos --request-rate inf --max-concurrency 16 --temperature 0.0
Metric MRV1 MRV2 (main) MRV2 (#37364)
Throughput
Duration (s) 59.19 57.50 57.26
Request throughput (req/s) 16.89 17.39 17.46
Output token throughput (tok/s) 4324.70 4452.34 4470.87
Peak output throughput (tok/s) 1776.00 1824.00 1826.00
Total token throughput (tok/s) 6624.22 6819.73 6848.12
Time to First Token
Mean TTFT (ms) 36.94 36.36 34.56
Median TTFT (ms) 29.29 28.06 28.28
P99 TTFT (ms) 464.28 493.51 380.98
Time per Output Token
Mean TPOT (ms) 3.54 3.44 3.44
Median TPOT (ms) 3.56 3.44 3.45
P99 TPOT (ms) 4.47 4.31 4.30
Inter-token Latency
Mean ITL (ms) 9.00 8.77 8.76
Median ITL (ms) 8.63 8.42 8.41
P99 ITL (ms) 13.00 14.49 13.50
Speculative Decoding
Acceptance rate (%) 51.67 51.92 52.03
Acceptance length 2.55 2.56 2.56
Drafts 100,405 100,131 100,020
Draft tokens 301,215 300,393 300,060
Accepted tokens 155,645 155,954 156,112
Position 0 acceptance (%) 71.34 71.60 71.67
Position 1 acceptance (%) 50.68 50.91 51.14
Position 2 acceptance (%) 32.99 33.24 33.27

@mergify mergify bot added the v1 label Mar 18, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the initialization of attention metadata for the draft model in speculative decoding. The responsibility for creating attention groups for the draft model is moved into the EagleSpeculator, which now filters for its own attention layers. This is a good change that improves encapsulation. However, I've identified a critical bug in the logic for identifying the draft model's attention layers, which would likely cause speculative decoding to fail. My review includes a specific code suggestion to correct this.

@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-draft-attn-groups branch from fe42d33 to 06cb19e Compare March 18, 2026 02:02
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-draft-attn-groups branch from 06cb19e to d57cd5f Compare March 18, 2026 19:00
@TheEpicDolphin TheEpicDolphin marked this pull request as ready for review March 18, 2026 23:22
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks! I think it'd be nice to verify this with Qwen 3.5 once it is supported with MRV2.

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026
@WoosukKwon WoosukKwon merged commit 3947451 into vllm-project:main Mar 20, 2026
62 of 63 checks passed
chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants