[Core] Support logprobs with spec decode + async scheduling by njhill · Pull Request #29223 · vllm-project/vllm

njhill · 2025-11-22T05:10:47Z

No description provided.

gemini-code-assist

Code Review

This pull request adds support for logprobs with speculative decoding and asynchronous scheduling. The changes involve refactoring how cumulative token counts are calculated and passed to correctly process logprobs in these scenarios. The modifications in vllm/v1/sample/rejection_sampler.py and vllm/v1/worker/gpu_model_runner.py seem correct and well-structured. New tests are added to cover these cases. My main concern is the significant increase in tolerance in tests/v1/sample/test_logprobs.py for comparing logprobs, which might hide numerical precision issues. Please see the specific comment for details.

tests/v1/sample/test_logprobs.py

mergify · 2025-11-22T14:51:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Nick Hill <nhill@redhat.com>

benchislett · 2025-11-25T20:13:56Z

vllm/v1/sample/rejection_sampler.py

+        cu_num_tokens = None
+        if return_cu_num_tokens:
+            cu_num_tokens = [0] + valid_mask.sum(axis=1).cumsum().tolist()
+        if len(discard_req_indices) > 0:


Why is this done after computing cu_num_tokens?

Because cu_num_tokens is used to index into the logprobs tensors that don't take the discarded indices into account, so doing it beforehand results in incorrect output.

Originally it was done before which was a bug, fixed by #29216.

vllm/v1/worker/gpu_model_runner.py

benchislett

Looks good overall

…ject#29223) Signed-off-by: Nick Hill <nhill@redhat.com>

### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with #4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>

### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>

…ject#29223) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

njhill requested review from 22quinn and houseroad as code owners November 22, 2025 05:10

mergify bot added the v1 label Nov 22, 2025

gemini-code-assist bot reviewed Nov 22, 2025

View reviewed changes

tests/v1/sample/test_logprobs.py Show resolved Hide resolved

njhill requested a review from benchislett November 22, 2025 05:13

mergify bot added the needs-rebase label Nov 22, 2025

[Core] Support logprobs with spec decode + async scheduling

cc55c14

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill force-pushed the as-sd-logprobs branch from 3caee35 to cc55c14 Compare November 22, 2025 18:42

njhill requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, robertgshaw2-redhat and ywang96 as code owners November 22, 2025 18:42

mergify bot removed the needs-rebase label Nov 22, 2025

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 22, 2025

This was referenced Nov 23, 2025

[Tracking Issue][Performance]: Speculative decoding performance/QoL improvements #28947

Open

Async Scheduling Plan #27679

Closed

benchislett reviewed Nov 25, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

benchislett reviewed Nov 25, 2025

View reviewed changes

benchislett approved these changes Nov 25, 2025

View reviewed changes

njhill merged commit 4e57c65 into vllm-project:main Nov 25, 2025
49 checks passed

njhill deleted the as-sd-logprobs branch November 25, 2025 20:55

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Core] Support logprobs with spec decode + async scheduling (vllm-pro…

21ec2fc

…ject#29223) Signed-off-by: Nick Hill <nhill@redhat.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Core] Support logprobs with spec decode + async scheduling (vllm-pro…

8ea146f

…ject#29223) Signed-off-by: Nick Hill <nhill@redhat.com>

zhandaz mentioned this pull request Dec 1, 2025

Update the docs for --async-scheduling compatibility vllm-project/recipes#131

Merged

wangxiyuan mentioned this pull request Dec 2, 2025

upgrade vLLM to main vllm-project/vllm-ascend#4608

Merged

This was referenced Dec 6, 2025

refactor rejection sampler vllm-project/vllm-ascend#4758

Closed

[Feat] Refactor rejection sampler vllm-project/vllm-ascend#4975

Merged

rain2bow mentioned this pull request Jan 13, 2026

feature: spec decoding 相关问题讨论 baidu/vLLM-Kunlun#107

Open

isomap mentioned this pull request Jan 15, 2026

feat: add speculative decoding during post-training NVIDIA-NeMo/RL#1785

Merged

4 tasks

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Core] Support logprobs with spec decode + async scheduling (vllm-pro…

73fd89b

…ject#29223) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Support logprobs with spec decode + async scheduling #29223

[Core] Support logprobs with spec decode + async scheduling #29223
njhill merged 1 commit intovllm-project:mainfrom
njhill:as-sd-logprobs

njhill commented Nov 22, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Nov 22, 2025

Uh oh!

benchislett Nov 25, 2025

Uh oh!

njhill Nov 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

benchislett left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhill commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Nov 22, 2025

Uh oh!

benchislett Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

njhill commented Nov 22, 2025 •

edited

Loading

njhill Nov 25, 2025 •

edited

Loading