Skip to content

[Model Runner V2] Rebuild attention metadata before eagle decode full…#38311

Merged
WoosukKwon merged 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-fix-eagle-decode-full-cudagraph
Mar 27, 2026
Merged

[Model Runner V2] Rebuild attention metadata before eagle decode full…#38311
WoosukKwon merged 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-fix-eagle-decode-full-cudagraph

Conversation

@TheEpicDolphin
Copy link
Copy Markdown
Collaborator

@TheEpicDolphin TheEpicDolphin commented Mar 27, 2026

Purpose

This PR addresses the low quality draft tokens produced at positions > 0 (after the draft prefill) that result from not rebuilding the attention metadata during FULL cudagraph. Rebuilding is necessary to update the state of the attention metadata builders/backend so that it doesn't contain stale values from previous runs. Doing so seems to improve the acceptance rates of draft tokens at positions > 0, as shown in the HTML link below.

Benchmarks

I ran a performance benchmark sweep for meta-llama/Meta-Llama-3-8B-Instruct/yuhuili/EAGLE-LLaMA3-Instruct-8B and Qwen/Qwen3-8B/RedHatAI/Qwen3-8B-speculator.eagle3, each with three speculative tokens and temperature = 0. I swept across TP = 1, 2, Output Length = 1, 1024, and Concurrency = 1, 8, 64. The results have been compiled into this HTML visualization here.

NOTE: "ol" means output length, and "c" stands for concurrency in the HTML visualization

The most notable deltas are that the spec decoding acceptance rate improvements from this PR compared to main (click the "Spec Decode" filter in the HTML to see what I mean). This is because we weren't rebuilding the attention metadata for the draft decoding steps during FULL cudagraph, leading to stale values in the attention metadata builder state/buffers for some attention backends. As a result, the quality of draft tokens at positions > 0 were worse. This is fixed now, and you can observe the improvement primarily in draft tokens 1 and 2, as expected.

… cudagraph

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
@mergify mergify bot added the v1 label Mar 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the EagleSpeculator by introducing helper methods for CUDA graph dispatching and attention metadata construction, while ensuring metadata is rebuilt during draft steps to maintain state. A critical issue was found in build_draft_attn_metadata where the use of .clamp on request indices disrupts the cumulative sum required for query_start_loc_cpu, which may lead to incorrect attention behavior in padded batches.

@TheEpicDolphin TheEpicDolphin marked this pull request as ready for review March 27, 2026 02:55
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@WoosukKwon WoosukKwon merged commit 384e4d5 into vllm-project:main Mar 27, 2026
64 checks passed
@TheEpicDolphin TheEpicDolphin deleted the gdelfin/mrv2-fix-eagle-decode-full-cudagraph branch March 27, 2026 21:33
SandishKumarHN pushed a commit to SandishKumarHN/vllm that referenced this pull request Mar 27, 2026
…hold to 0.91

The LM Eval Large Models (H200) CI job was failing because the
NVIDIA-Nemotron-3-Super-120B-A12B-BF16 model scored slightly below the
0.93 accuracy threshold on GSM8K.

The model uses MTP speculative decoding with 5 speculative tokens. Recent
changes to the Model Runner V2 spec decode path (PRs vllm-project#38045 and vllm-project#38311)
adjusted rejection sampling behavior and rebuilt attention metadata before
eagle decode, which can marginally affect the acceptance rate and therefore
the final accuracy score.

Lower the threshold from 0.93 to 0.91 to reflect the current achievable
accuracy with the updated spec decode implementation. The model still
demonstrates strong GSM8K performance above 91%.

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
vllm-project#38311)

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
SandishKumarHN added a commit to SandishKumarHN/vllm that referenced this pull request Mar 27, 2026
…hold to 0.91

The LM Eval Large Models (H200) CI job was failing because the
NVIDIA-Nemotron-3-Super-120B-A12B-BF16 model scored slightly below the
0.93 accuracy threshold on GSM8K.

The model uses MTP speculative decoding with 5 speculative tokens. Recent
changes to the Model Runner V2 spec decode path (PRs vllm-project#38045 and vllm-project#38311)
adjusted rejection sampling behavior and rebuilt attention metadata before
eagle decode, which can marginally affect the acceptance rate and therefore
the final accuracy score.

Lower the threshold from 0.93 to 0.91 to reflect the current achievable
accuracy with the updated spec decode implementation. The model still
demonstrates strong GSM8K performance above 91%.

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
Elm8116 pushed a commit to Elm8116/vllm that referenced this pull request Mar 30, 2026
vllm-project#38311)

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
vllm-project#38311)

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
benenzhu pushed a commit to benenzhu/vllm that referenced this pull request Mar 31, 2026
vllm-project#38311)

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
neweyes pushed a commit to neweyes/vllm that referenced this pull request Mar 31, 2026
vllm-project#38311)

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: neweyes <328719365@qq.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
vllm-project#38311)

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
bhargav-patel-29 pushed a commit to Bharatgen-Tech/vllm that referenced this pull request Apr 1, 2026
vllm-project#38311)

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants