Andy/spec probs by andylolu2 · Pull Request #35398 · vllm-project/vllm

andylolu2 · 2026-02-26T12:24:18Z

Purpose

Support drafter probabilities in ModelRunnerV2.

Test Plan

TBD

Test Result

TBD

Signed-off-by: Andy Lo <andy@mistral.ai>

gemini-code-assist

Code Review

This pull request introduces support for using draft model probabilities in speculative decoding, a significant enhancement. The changes involve refactoring the sampling logic and updating data structures like InputBatch and RequestState to handle draft logits. While the overall direction is good, I've identified two critical bugs related to incorrect tensor shapes and arguments being passed to kernel functions. These issues will lead to incorrect behavior and must be addressed.

vllm/v1/worker/gpu/sample/sampler.py

gemini-code-assist · 2026-02-26T12:26:45Z

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py

-            return draft_tokens.view(-1, 1)
+            return Speculation(
+                draft_tokens=draft_tokens.view(-1, 1),
+                draft_logits=draft_logits.view(-1, 1),


The draft_logits tensor is being reshaped incorrectly. draft_logits has a shape of [num_reqs, vocab_size]. Using .view(-1, 1) results in a tensor of shape [num_reqs * vocab_size, 1], which is incorrect.

The Speculation dataclass expects draft_logits to have a shape of [num_reqs, num_speculative_steps, vocab_size]. For num_speculative_steps=1, the correct shape is [num_reqs, 1, vocab_size]. You should use .unsqueeze(1) to add a dimension.

Suggested change

draft_logits=draft_logits.view(-1, 1),

draft_logits=draft_logits.unsqueeze(1),

Signed-off-by: Andy Lo <andy@mistral.ai>

andylolu2 · 2026-02-27T00:53:11Z

/gemini review I think there's a bug when cudagraph is enabled, do you see it

gemini-code-assist

Code Review

This pull request introduces support for using draft model probabilities in speculative decoding, which is a significant enhancement. The changes involve propagating draft logits through the system, refactoring the sampler for probabilistic rejection sampling, and updating the speculator accordingly. A new correctness test for sampling recovered and bonus tokens is a valuable addition. While the overall direction is good, I've identified a critical issue with the log-probability calculation in the speculative decoding path. The logic for computing logprobs for accepted tokens appears to be incorrect due to tensor shape mismatches and not accounting for different token origins (accepted draft vs. recovered), which will likely lead to incorrect output.

gemini-code-assist · 2026-02-27T00:54:59Z

vllm/v1/worker/gpu/sample/sampler.py

+        max_num_logprobs = self.sampling_states.max_num_logprobs(idx_mapping_np)
+        if max_num_logprobs != NO_LOGPROBS:
+            expanded_logits = logits.shape[0] != idx_mapping_np.shape[0]
+            cu_num_logits_list = cu_num_logits_np.tolist() if expanded_logits else None
+            logprobs_tensors = compute_topk_logprobs(
+                processed_logits, max_num_logprobs, sampled, cu_num_logits_list
+            )


The log-probability calculation for speculative decoding appears to be incorrect. The compute_topk_logprobs function is called with a sampled tensor of shape [num_reqs, num_speculative_steps + 1] and processed_logits of shape [num_draft_tokens + num_reqs, vocab_size]. The sampled tensor is 2D and padded, while compute_topk_logprobs likely expects a 1D tensor of token IDs that aligns with the provided logits.

Furthermore, the sampled tokens are a mix of accepted draft tokens and "recovered" tokens. The logprobs for these two types of tokens come from different distributions (p_target vs. renormalize(max(0, p_target - p_draft))). The current implementation does not seem to distinguish between them when computing logprobs, passing only processed_logits (derived from p_target for proposed tokens). This will lead to incorrect logprob values.

Signed-off-by: Andy Lo <andy@mistral.ai>

TheEpicDolphin · 2026-03-02T21:00:01Z

Hi @andylolu2, I was tasked with working on probabilistic rejection sampling for MRV2 in #35461, and stumbled upon your PR here. Looks like we took similar approaches. I have some benchmark results on my PR showing some acceptance rate improvements. I'm happy to compare implementations and figure out the best path forward for us to unlock this feature. Perhaps we can combine the best parts of both!

cc: @WoosukKwon

mergify · 2026-03-03T07:19:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @andylolu2.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

andylolu2 · 2026-03-03T16:30:37Z

Hi @andylolu2, I was tasked with working on probabilistic rejection sampling for MRV2 in #35461, and stumbled upon your PR here. Looks like we took similar approaches. I have some benchmark results on my PR showing some acceptance rate improvements. I'm happy to compare implementations and figure out the best path forward for us to unlock this feature. Perhaps we can combine the best parts of both!

cc: @WoosukKwon

Oooo amazing, I am a bit short on bandwidth atm so please feel free to take over!

I have some take aways while implementing this, let me comment in your PR

andylolu2 added 2 commits February 26, 2026 03:21

Works

eeef237

Signed-off-by: Andy Lo <andy@mistral.ai>

test

75b28cb

Signed-off-by: Andy Lo <andy@mistral.ai>

mergify bot added speculative-decoding v1 labels Feb 26, 2026

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

Fix

ae0f2fe

Signed-off-by: Andy Lo <andy@mistral.ai>

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

Clean

6fd5b4c

Signed-off-by: Andy Lo <andy@mistral.ai>

mergify bot added the needs-rebase label Mar 3, 2026

andylolu2 mentioned this pull request Mar 3, 2026

[Model Runner V2] Add probabilistic rejection sampling for spec decoding #35461

Merged

andylolu2 closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Andy/spec probs#35398

Andy/spec probs#35398
andylolu2 wants to merge 4 commits intovllm-project:mainfrom
andylolu2:andy/spec-probs

andylolu2 commented Feb 26, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 26, 2026

Uh oh!

andylolu2 commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

TheEpicDolphin commented Mar 2, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

andylolu2 commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	draft_logits=draft_logits.view(-1, 1),
	draft_logits=draft_logits.unsqueeze(1),

Uh oh!

Conversation

andylolu2 commented Feb 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

andylolu2 commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

TheEpicDolphin commented Mar 2, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

andylolu2 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andylolu2 commented Feb 26, 2026 •

edited by github-actions bot

Loading

andylolu2 commented Mar 3, 2026 •

edited

Loading