[Model Runner V2] fix draft attention metadata generation by TheEpicDolphin · Pull Request #37364 · vllm-project/vllm

TheEpicDolphin · 2026-03-18T01:43:25Z

Purpose

Hybrid models + spec decoding currently builds metadata for the draft model decode-drafting phase for all KV cache groups, even though the draft model typically only belongs to one of them. This is wasteful.

This PR fixes this by simply skipping building attention groups that the draft model does not belong to, and using those to generate the attention metadata for decode-drafting. This is similar to what Model Runner V1 does.

Benchmarks

I used the following commands to benchmark and verify that there is no drop in acceptance rate. I benchmarked using openai/gpt-oss-20b, which is a hybrid model (alternates between sliding window and full attention). MRV2 main, MRV2 with my changes, and MRV1 results are very similar.

Server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve openai/gpt-oss-20b --no-enable-prefix-caching --tensor-parallel-size=1 --data-parallel-size=1 --speculative-config '{"method": "eagle3", "model": "RedHatAI/gpt-oss-20b-speculator.eagle3", "num_speculative_tokens": 3}'

Client

vllm bench serve --model openai/gpt-oss-20b --tokenizer openai/gpt-oss-20b --host 0.0.0.0 --dataset-name hf --dataset-path philschmid/mt-bench --ignore-eos --request-rate inf --max-concurrency 16 --temperature 0.0

Metric	MRV1	MRV2 (main)	MRV2 (#37364)
Throughput
Duration (s)	59.19	57.50	57.26
Request throughput (req/s)	16.89	17.39	17.46
Output token throughput (tok/s)	4324.70	4452.34	4470.87
Peak output throughput (tok/s)	1776.00	1824.00	1826.00
Total token throughput (tok/s)	6624.22	6819.73	6848.12
Time to First Token
Mean TTFT (ms)	36.94	36.36	34.56
Median TTFT (ms)	29.29	28.06	28.28
P99 TTFT (ms)	464.28	493.51	380.98
Time per Output Token
Mean TPOT (ms)	3.54	3.44	3.44
Median TPOT (ms)	3.56	3.44	3.45
P99 TPOT (ms)	4.47	4.31	4.30
Inter-token Latency
Mean ITL (ms)	9.00	8.77	8.76
Median ITL (ms)	8.63	8.42	8.41
P99 ITL (ms)	13.00	14.49	13.50
Speculative Decoding
Acceptance rate (%)	51.67	51.92	52.03
Acceptance length	2.55	2.56	2.56
Drafts	100,405	100,131	100,020
Draft tokens	301,215	300,393	300,060
Accepted tokens	155,645	155,954	156,112
Position 0 acceptance (%)	71.34	71.60	71.67
Position 1 acceptance (%)	50.68	50.91	51.14
Position 2 acceptance (%)	32.99	33.24	33.27

gemini-code-assist

Code Review

This pull request refactors the initialization of attention metadata for the draft model in speculative decoding. The responsibility for creating attention groups for the draft model is moved into the EagleSpeculator, which now filters for its own attention layers. This is a good change that improves encapsulation. However, I've identified a critical bug in the logic for identifying the draft model's attention layers, which would likely cause speculative decoding to fail. My review includes a specific code suggestion to correct this.

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

WoosukKwon

LGTM Thanks! I think it'd be nice to verify this with Qwen 3.5 once it is supported with MRV2.

…ct#37364) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

mergify bot added the v1 label Mar 18, 2026

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py Outdated Show resolved Hide resolved

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-draft-attn-groups branch from fe42d33 to 06cb19e Compare March 18, 2026 02:02

[Model Runner V2] fix draft attention metadata generation

d57cd5f

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-draft-attn-groups branch from 06cb19e to d57cd5f Compare March 18, 2026 19:00

TheEpicDolphin marked this pull request as ready for review March 18, 2026 23:22

TheEpicDolphin requested review from WoosukKwon and njhill as code owners March 18, 2026 23:22

WoosukKwon approved these changes Mar 20, 2026

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026

TheEpicDolphin mentioned this pull request Mar 20, 2026

[Model Runner V2] Fix draft logits not populated during cudagraph replay #37639

Merged

WoosukKwon merged commit 3947451 into vllm-project:main Mar 20, 2026
62 of 63 checks passed

chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026

[Model Runner V2] fix draft attention metadata generation (vllm-proje…

907f0f2

…ct#37364) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Runner V2] fix draft attention metadata generation#37364

[Model Runner V2] fix draft attention metadata generation#37364
WoosukKwon merged 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-spec-decode-draft-attn-groups

TheEpicDolphin commented Mar 18, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TheEpicDolphin commented Mar 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Benchmarks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheEpicDolphin commented Mar 18, 2026 •

edited by github-actions bot

Loading