[Model Runner V2] Rebuild attention metadata before eagle decode full… by TheEpicDolphin · Pull Request #38311 · vllm-project/vllm

TheEpicDolphin · 2026-03-27T02:43:39Z

Purpose

This PR addresses the low quality draft tokens produced at positions > 0 (after the draft prefill) that result from not rebuilding the attention metadata during FULL cudagraph. Rebuilding is necessary to update the state of the attention metadata builders/backend so that it doesn't contain stale values from previous runs. Doing so seems to improve the acceptance rates of draft tokens at positions > 0, as shown in the HTML link below.

Benchmarks

I ran a performance benchmark sweep for meta-llama/Meta-Llama-3-8B-Instruct/yuhuili/EAGLE-LLaMA3-Instruct-8B and Qwen/Qwen3-8B/RedHatAI/Qwen3-8B-speculator.eagle3, each with three speculative tokens and temperature = 0. I swept across TP = 1, 2, Output Length = 1, 1024, and Concurrency = 1, 8, 64. The results have been compiled into this HTML visualization here.

NOTE: "ol" means output length, and "c" stands for concurrency in the HTML visualization

The most notable deltas are that the spec decoding acceptance rate improvements from this PR compared to main (click the "Spec Decode" filter in the HTML to see what I mean). This is because we weren't rebuilding the attention metadata for the draft decoding steps during FULL cudagraph, leading to stale values in the attention metadata builder state/buffers for some attention backends. As a result, the quality of draft tokens at positions > 0 were worse. This is fixed now, and you can observe the improvement primarily in draft tokens 1 and 2, as expected.

… cudagraph Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

gemini-code-assist

Code Review

This pull request refactors the EagleSpeculator by introducing helper methods for CUDA graph dispatching and attention metadata construction, while ensuring metadata is rebuilt during draft steps to maintain state. A critical issue was found in build_draft_attn_metadata where the use of .clamp on request indices disrupts the cumulative sum required for query_start_loc_cpu, which may lead to incorrect attention behavior in padded batches.

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

WoosukKwon

LGTM. Thanks for the fix!

…hold to 0.91 The LM Eval Large Models (H200) CI job was failing because the NVIDIA-Nemotron-3-Super-120B-A12B-BF16 model scored slightly below the 0.93 accuracy threshold on GSM8K. The model uses MTP speculative decoding with 5 speculative tokens. Recent changes to the Model Runner V2 spec decode path (PRs vllm-project#38045 and vllm-project#38311) adjusted rejection sampling behavior and rebuilt attention metadata before eagle decode, which can marginally affect the acceptance rate and therefore the final accuracy score. Lower the threshold from 0.93 to 0.91 to reflect the current achievable accuracy with the updated spec decode implementation. The model still demonstrates strong GSM8K performance above 91%. Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…hold to 0.91 The LM Eval Large Models (H200) CI job was failing because the NVIDIA-Nemotron-3-Super-120B-A12B-BF16 model scored slightly below the 0.93 accuracy threshold on GSM8K. The model uses MTP speculative decoding with 5 speculative tokens. Recent changes to the Model Runner V2 spec decode path (PRs vllm-project#38045 and vllm-project#38311) adjusted rejection sampling behavior and rebuilt attention metadata before eagle decode, which can marginally affect the acceptance rate and therefore the final accuracy score. Lower the threshold from 0.93 to 0.91 to reflect the current achievable accuracy with the updated spec decode implementation. The model still demonstrates strong GSM8K performance above 91%. Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: neweyes <328719365@qq.com>

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: EricccYang <yangyang4991@gmail.com>

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>

[Model Runner V2] Rebuild attention metadata before eagle decode full…

14e1a36

… cudagraph Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

mergify bot added the v1 label Mar 27, 2026

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py Show resolved Hide resolved

TheEpicDolphin marked this pull request as ready for review March 27, 2026 02:55

TheEpicDolphin requested review from WoosukKwon and njhill as code owners March 27, 2026 02:55

claude bot reviewed Mar 27, 2026

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2026

WoosukKwon approved these changes Mar 27, 2026

View reviewed changes

WoosukKwon merged commit 384e4d5 into vllm-project:main Mar 27, 2026
64 checks passed

TheEpicDolphin deleted the gdelfin/mrv2-fix-eagle-decode-full-cudagraph branch March 27, 2026 21:33

SandishKumarHN mentioned this pull request Mar 27, 2026

[CI][Eval] Lower Nemotron-3-Super-120B-A12B-BF16 GSM8K accuracy threshold to 0.91 #38403

Closed

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Model Runner V2] Rebuild attention metadata before eagle decode full… (

c9639c0

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

[Model Runner V2] Rebuild attention metadata before eagle decode full… (

bd862f2

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

benenzhu pushed a commit to benenzhu/vllm that referenced this pull request Mar 31, 2026

[Model Runner V2] Rebuild attention metadata before eagle decode full… (

1ed3dc7

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

neweyes pushed a commit to neweyes/vllm that referenced this pull request Mar 31, 2026

[Model Runner V2] Rebuild attention metadata before eagle decode full… (

8a3f38b

vllm-project#38311) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: neweyes <328719365@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Runner V2] Rebuild attention metadata before eagle decode full…#38311

[Model Runner V2] Rebuild attention metadata before eagle decode full…#38311
WoosukKwon merged 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-fix-eagle-decode-full-cudagraph

TheEpicDolphin commented Mar 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

claude bot left a comment

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TheEpicDolphin commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Benchmarks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheEpicDolphin commented Mar 27, 2026 •

edited

Loading