[perf] v1/spec_decode: skip softmax for all-greedy rejection sampling by caozuoba · Pull Request #32852 · vllm-project/vllm

caozuoba · 2026-01-22T12:47:49Z

Purpose

This PR avoids computing a full-vocabulary softmax in the v1 speculative decoding rejection sampler when the entire batch is greedy (sampling_metadata.all_greedy).

For all-greedy decoding, the rejection sampler only needs argmax(target_logits); a dense softmax is unnecessary work. Since argmax(softmax(logits)) == argmax(logits), this change is behavior-preserving for the greedy path while reducing compute/memory overhead.

Test Result

Correctness (pytest)

Command

pytest -q tests/v1/sample/test_rejection_sampler.py

Result

.....................................                                                                                                                   [100%]
====================================================================== warnings summary =======================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
37 passed, 2 warnings in 30.04s

Performance

Compared to main, on NVIDIA H800, this PR improves Output token throughput (tok/s) by ~3.14%, reduces Mean TPOT (ms) by ~7.04%, and reduces Mean E2EL (ms) by ~3.67%.

main

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  11.14
Total input tokens:                      128000
Total generated tokens:                  100000
Request throughput (req/s):              89.74
Output token throughput (tok/s):         8974.16
Peak output token throughput (tok/s):    7331.00
Peak concurrent requests:                1000.00
Total token throughput (tok/s):          20461.07
---------------Time to First Token----------------
Mean TTFT (ms):                          4673.80
Median TTFT (ms):                        2721.85
P99 TTFT (ms):                           8599.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.22
Median TPOT (ms):                        37.92
P99 TPOT (ms):                           67.62
---------------Inter-token Latency----------------
Mean ITL (ms):                           79.31
Median ITL (ms):                         72.16
P99 ITL (ms):                            122.20
----------------End-to-end Latency----------------
Mean E2EL (ms):                          8557.05
Median E2EL (ms):                        8733.72
P99 E2EL (ms):                           10952.13
==================================================

PR

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  10.80
Total input tokens:                      128000
Total generated tokens:                  100000
Request throughput (req/s):              92.56
Output token throughput (tok/s):         9255.84
Peak output token throughput (tok/s):    8709.00
Peak concurrent requests:                1000.00
Total token throughput (tok/s):          21103.32
---------------Time to First Token----------------
Mean TTFT (ms):                          4633.34
Median TTFT (ms):                        2941.03
P99 TTFT (ms):                           8419.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.46
Median TPOT (ms):                        34.74
P99 TPOT (ms):                           65.33
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.20
Median ITL (ms):                         69.39
P99 ITL (ms):                            109.64
----------------End-to-end Latency----------------
Mean E2EL (ms):                          8243.21
Median E2EL (ms):                        8503.44
P99 E2EL (ms):                           10646.14
==================================================

gemini-code-assist

Code Review

This pull request introduces a valuable performance optimization by skipping the full-vocabulary softmax calculation during greedy decoding in the rejection sampler. The change correctly leverages the mathematical equivalence of argmax(softmax(logits)) and argmax(logits), leading to improved output token throughput and reduced latency as demonstrated by the provided benchmark results. The logic appears sound and the change is well-justified for performance gains.

gemini-code-assist · 2026-01-22T12:49:40Z

+        # NOTE: For all-greedy decoding, the rejection sampler only needs
+        # argmax(target_logits), so computing a full-vocab softmax is wasted.
+        if sampling_metadata.all_greedy:
+            target_probs = target_logits


The variable target_probs is assigned target_logits when sampling_metadata.all_greedy is true. While the current usage in rejection_sample correctly handles this dual meaning (using it for argmax in greedy mode and as probabilities in random mode), the name target_probs can be misleading as it typically implies a probability distribution (values summing to 1). This could lead to confusion for future developers and potentially introduce bugs if the variable is used in contexts where actual probabilities are strictly expected, without explicitly checking the all_greedy flag. Consider using a more generic name for this variable, such as target_sampling_input, to accurately reflect its conditional content.

mgoin

This seems reasonable to me but I'd like @benchislett or @WoosukKwon to sign off

benchislett · 2026-01-22T20:00:23Z

Only concern would be if those probs are used downstream in anything besides sampling. Otherwise looks good

caozuoba · 2026-01-23T00:45:20Z

Only concern would be if those probs are used downstream in anything besides sampling. Otherwise looks good

Regarding this concern: in the all_greedy case we only use argmax and return early in rejection_sample, so the values are never treated as normalized probabilities. Using logits there is behavior-preserving (argmax(softmax(x)) == argmax(x)). Non-greedy paths still use softmax as before. Thanks for your review

jeejeelee · 2026-01-23T04:58:44Z

@benchislett Could you please take another look?

caozuoba · 2026-01-25T06:54:05Z

@mgoin @benchislett Hi, could you please let me know if this PR is ready to be merged? If you’d like me to run any additional tests on my side, please tell me which ones and I’ll do that.Thanks for your time and review.

caozuoba · 2026-01-30T15:30:29Z

Could someone please help move this PR forward? Thanks. @mgoin @benchislett @WoosukKwon @njhill

njhill

Thanks @caozuoba.

I think it would be clearer to have rejection_sample take the target logits rather than probs, and move the softmax inside there.

Please also sign-off your commits for the DCO.

caozuoba · 2026-01-30T16:29:13Z

Thanks @caozuoba.

I think it would be clearer to have rejection_sample take the target logits rather than probs, and move the softmax inside there.

Please also sign-off your commits for the DCO.

@njhill Thanks for the feedback! Agree this would be clearer. I’ll update rejection_sample to take the target logits and move the softmax inside the function.
I’ll also sign off my commits for the DCO and push an updated version shortly.

Signed-off-by: hdj <1293066020@qq.com>

caozuoba · 2026-01-30T17:41:59Z

@njhill Thanks! I’ve updated the code to have rejection_sample take target logits and moved the softmax inside.I also rebased and added DCO sign-offs to all commits (force-pushed to the same PR branch).Could you please take another look when you have a chance?

benchislett

LGTM

…vllm-project#32852) Signed-off-by: hdj <1293066020@qq.com> Signed-off-by: Pai <416932041@qq.com>

) ### What this PR does / why we need it? This PR aims to update `target_probs` to `target_logits` in `rejection_sample`, for catching up with vllm-project/vllm#32852. Otherwise, sampling with temperature will incur accuracy problem where tokens can be accepted or rejected unreasonably. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: Zetong Li <slippersss@126.com>

…lm-project#6685) ### What this PR does / why we need it? This PR aims to update `target_probs` to `target_logits` in `rejection_sample`, for catching up with vllm-project/vllm#32852. Otherwise, sampling with temperature will incur accuracy problem where tokens can be accepted or rejected unreasonably. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: Zetong Li <slippersss@126.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

…lm-project#6685) ### What this PR does / why we need it? This PR aims to update `target_probs` to `target_logits` in `rejection_sample`, for catching up with vllm-project/vllm#32852. Otherwise, sampling with temperature will incur accuracy problem where tokens can be accepted or rejected unreasonably. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: Zetong Li <slippersss@126.com>

…lm-project#6685) ### What this PR does / why we need it? This PR aims to update `target_probs` to `target_logits` in `rejection_sample`, for catching up with vllm-project/vllm#32852. Otherwise, sampling with temperature will incur accuracy problem where tokens can be accepted or rejected unreasonably. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: Zetong Li <slippersss@126.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…lm-project#6685) ### What this PR does / why we need it? This PR aims to update `target_probs` to `target_logits` in `rejection_sample`, for catching up with vllm-project/vllm#32852. Otherwise, sampling with temperature will incur accuracy problem where tokens can be accepted or rejected unreasonably. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: Zetong Li <slippersss@126.com>

…lm-project#6685) ### What this PR does / why we need it? This PR aims to update `target_probs` to `target_logits` in `rejection_sample`, for catching up with vllm-project/vllm#32852. Otherwise, sampling with temperature will incur accuracy problem where tokens can be accepted or rejected unreasonably. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: Zetong Li <slippersss@126.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…lm-project#6685) ### What this PR does / why we need it? This PR aims to update `target_probs` to `target_logits` in `rejection_sample`, for catching up with vllm-project/vllm#32852. Otherwise, sampling with temperature will incur accuracy problem where tokens can be accepted or rejected unreasonably. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: Zetong Li <slippersss@126.com>

caozuoba requested review from 22quinn, houseroad and njhill as code owners January 22, 2026 12:47

mergify Bot added the v1 label Jan 22, 2026

gemini-code-assist Bot reviewed Jan 22, 2026

View reviewed changes

mgoin requested a review from benchislett January 22, 2026 13:56

mgoin approved these changes Jan 22, 2026

View reviewed changes

mgoin requested a review from WoosukKwon January 22, 2026 13:57

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 22, 2026

njhill reviewed Jan 30, 2026

View reviewed changes

caozuoba added 2 commits January 31, 2026 01:35

[perf] v1/spec_decode: skip softmax for all-greedy rejection sampling

64da011

Signed-off-by: hdj <1293066020@qq.com>

v1/spec_decode: pass target logits to rejection_sample

bf858cb

Signed-off-by: hdj <1293066020@qq.com>

caozuoba force-pushed the perf/rejection-sampler-greedy branch from 4785ff8 to bf858cb Compare January 30, 2026 17:35

benchislett approved these changes Jan 30, 2026

View reviewed changes

benchislett enabled auto-merge (squash) January 30, 2026 19:09

caozuoba added 4 commits January 31, 2026 07:09

Merge branch 'main' into perf/rejection-sampler-greedy

b0ca6e5

Merge branch 'main' into perf/rejection-sampler-greedy

1e86bfc

Merge branch 'main' into perf/rejection-sampler-greedy

cd8f1ef

Merge branch 'main' into perf/rejection-sampler-greedy

360a41b

benchislett merged commit 8980001 into vllm-project:main Jan 31, 2026
40 checks passed

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026

[perf] v1/spec_decode: skip softmax for all-greedy rejection sampling (…

a83dfa7

…vllm-project#32852) Signed-off-by: hdj <1293066020@qq.com> Signed-off-by: Pai <416932041@qq.com>

slippersss mentioned this pull request Feb 11, 2026

[Bugfix] Update target probs to target logits in rejection sample vllm-project/vllm-ascend#6685

Merged

Uh oh!

Conversation

caozuoba commented Jan 22, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Result

Correctness (pytest)

Performance

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett commented Jan 22, 2026

Uh oh!

caozuoba commented Jan 23, 2026

Uh oh!

jeejeelee commented Jan 23, 2026

Uh oh!

caozuoba commented Jan 25, 2026

Uh oh!

caozuoba commented Jan 30, 2026

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

caozuoba commented Jan 30, 2026

Uh oh!

caozuoba commented Jan 30, 2026

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

caozuoba commented Jan 22, 2026 •

edited by github-actions Bot

Loading