Skip to content

Avoid eager recovery sampling in speculative rejection#41258

Open
masterFoad wants to merge 3 commits into
vllm-project:mainfrom
masterFoad:lazy-speculative-rejection
Open

Avoid eager recovery sampling in speculative rejection#41258
masterFoad wants to merge 3 commits into
vllm-project:mainfrom
masterFoad:lazy-speculative-rejection

Conversation

@masterFoad
Copy link
Copy Markdown

@masterFoad masterFoad commented Apr 29, 2026

Purpose

This PR avoids eager recovery sampling in v1 speculative rejection sampling.

Before, the production random path sampled recovered tokens for every draft position before acceptance was known. The updated path accepts draft tokens first and computes recovery only after the first rejection.

The latest patch also replaces production recovery's per-request full-vocab exponential race with one per-token inverse-CDF threshold. This keeps the recovery distribution equivalent to sampling from (target_prob - draft_prob)^+, while avoiding batch_size * vocab_size random recovery noise on accepted requests.

Implementation

  • Keep the greedy path unchanged.
  • Generate acceptance uniforms as before.
  • Generate one recovery uniform per draft token for random requests.
  • In rejection_random_sample_kernel, scan the vocabulary only if a request rejects.
  • Use inverse-CDF over the unnormalized recovery mass.
  • Keep sample_recovered_tokens as an eager compatibility helper for existing tests.

Correctness note: seeded requests remain deterministic within this implementation, but recovered token IDs are not guaranteed to be bit-identical to the previous exponential-race implementation because the RNG stream changed. The distribution is the intended contract and is covered by the A100 distribution check below.

Duplicate-work check

This updates the existing PR #41258. I checked open PRs with speculative rejection and recovery keywords.

Related but not duplicate:

Test plan and results

Local checks:

git diff --check
.venv/bin/python -m py_compile vllm/v1/sample/rejection_sampler.py tests/v1/sample/test_rejection_sampler.py
uvx ruff check vllm/v1/sample/rejection_sampler.py tests/v1/sample/test_rejection_sampler.py

Result: passed.

OpenShift GPU smoke on A100-SXM4-80GB:

job: vllm-final-smoke-132730
smoke_deterministic passed
smoke_synthetic passed
ALL FULL REJECTION KERNEL SMOKE CHECKS PASSED

OpenShift distribution check on A100-SXM4-80GB:

job: vllm-cdf-dist-114215
n 100 target_dist 0.0820042938 ref_dist 0.0141119945
n 1000 target_dist 0.0281292591 ref_dist 0.0120544052
n 10000 target_dist 0.0086425478 ref_dist 0.0120583880
n 100000 target_dist 0.0028381334 ref_dist 0.0120294106
improvement target 28.8937 reference 1.1731 ratio 24.6297
DISTRIBUTION CHECK PASSED

OpenShift end-to-end benchmark on A100-SXM4-80GB:

job: vllm-cdf-bench-113122
model: facebook/opt-125m
speculative method: ngram
num speculative tokens: 16
num prompts: 64
max output tokens: 96
warmups: 2
repetitions: 5

Results vs eager recovery baseline:

Scenario: standard
median req/s      : normal=300.488 lazy=328.090 delta=+9.19%
median gen tok/s  : normal=28846.813 lazy=31496.624 delta=+9.19%
acceptance len    : normal=14.155 lazy=14.298
draft tokens      : normal=34720 lazy=34080
first-run outputs : DIFFER

Scenario: synthetic_high_accept
median req/s      : normal=213.698 lazy=224.172 delta=+4.90%
median gen tok/s  : normal=20514.979 lazy=21520.482 delta=+4.90%
acceptance len    : normal=12.667 lazy=12.667
draft tokens      : normal=28120 lazy=28120
first-run outputs : DIFFER

The output difference is expected from the inverse-CDF RNG stream change. The distribution check above is the semantic validation.

Additional OpenShift performance evidence

OpenShift A100 mechanism check:

job: vllm-bench-lay-spec-mechanism-20260524-203123
GPU: NVIDIA A100-SXM4-80GB
shape: batch_size=64, max_spec_len=16, vocab_size=50272, trials=200
contract: pass

Mechanism result:

full-vocab recovery random values: eager=3,217,408 lazy=1,024 delta=3142.0x fewer
recovery buffer bytes             : eager=12,869,632 lazy=8,192 delta=1571.0x fewer
all-accept median mechanism time  : eager=1.284 ms lazy=0.140 ms delta=-89.10%

Tradeoff probe:

first-reject median mechanism time: eager artifact=1.284 ms lazy artifact=2.845 ms delta=+121.56%

Interpretation: this supports the intended mechanism claim, not a universal latency claim. The PR removes eager full-vocab recovered-token materialization from accepted speculative paths. It still generates one scalar recovery uniform per draft token, and recovery scanning happens only after rejection.

OpenShift A100 serving latency check:

job: vllm-bench-lazy-spec-latency
model: facebook/opt-125m
speculative method: ngram
num speculative tokens: 16
num prompts: 128
fixed output tokens: 96
max concurrency: 64
warmups: 8 streaming requests

Results vs eager recovery baseline:

Scenario: standard
request throughput : +33.93%
output throughput  : +33.93%
TTFT p95 / p99     : -48.44% / -48.15%
TPOT p95 / p99     : -47.83% / -27.02%
ITL p95 / p99      : -27.83% / -85.73%
E2EL p95 / p99     : -35.06% / -33.00%
notes              : median TTFT was +3.49%, median ITL was +105.50%

Scenario: synthetic_high_accept
request throughput : +4.14%
output throughput  : +4.14%
E2EL p99           : -10.54%
notes              : TTFT p95/p99 were +6.11%/+8.71%, ITL p95/p99 were +64.13%/+46.01%

Metric definitions:

  • TTFT: request start to first streamed text chunk.
  • ITL: time between subsequent streamed text chunks, so it includes streaming and client scheduling effects.
  • TPOT: (end-to-end latency - TTFT) / (expected_output_tokens - 1). This benchmark uses fixed 96-token requests with ignore_eos=True.
  • E2EL: request start to stream completion.
  • Spec decode counters: Prometheus /metrics deltas around each run.

I also tested a local early-exit idea for the recovery CDF scan. OpenShift serving evidence was mixed, so I did not include or push it.

AI assistance

This PR was AI-assisted. I reviewed the changed lines and the validation results, and I can explain the algorithm, tradeoffs, and benchmark limitations.

Effective PR description checklist
  • Purpose is described.
  • Test commands are listed.
  • Test results are included.
  • AI assistance is disclosed.
  • Duplicate-work check is included.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added the v1 label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the rejection sampling logic to implement lazy token recovery within the Triton kernel, removing the separate pre-sampling stage. Key changes include updating the rejection_sample workflow and modifying data types for output buffers and random probabilities. Feedback identifies critical bugs in the refactored apply_sampling_constraints function, specifically the removal of temperature scaling and incorrect parameter expansion. It is also recommended to use int32 for output tokens and float64 for uniform probabilities to maintain consistency and numerical stability.

Comment thread vllm/v1/sample/rejection_sampler.py Outdated
Comment thread vllm/v1/sample/rejection_sampler.py Outdated
Comment thread vllm/v1/sample/rejection_sampler.py Outdated
Comment thread vllm/v1/sample/rejection_sampler.py
@masterFoad masterFoad force-pushed the lazy-speculative-rejection branch from 4fb2df6 to 9a4a5a9 Compare April 29, 2026 16:17
@masterFoad
Copy link
Copy Markdown
Author

In addition, the following results are after benchmark run on my laptop (WSL/ windows):

Performance Gains (RTX 4060 Laptop GPU, 8GB VRAM):

  • Kernel Speedup: ~8.0x (0.27ms vs 2.15ms) on a targeted Triton kernel benchmark (8 positions, reject at pos 0).
  • End-to-End Speculative Decoding Throughput: +14.7% (15.6 tokens/s vs 13.6 tokens/s) using Qwen2.5-1.5B with Qwen2.5-0.5B drafter.

@masterFoad
Copy link
Copy Markdown
Author

masterFoad commented May 22, 2026

Update: I force-pushed a cleaned-up single-commit version of this PR.

Changes since the earlier review:

  • Restored the thinking-budget logits-processing path so the PR stays scoped to speculative rejection sampling.
  • Kept output_token_ids allocated before greedy/random branching, including the all-random path.
  • Preserved float64 uniform probabilities and int32 output token IDs.
  • Generate uniform probabilities for synthetic mode even when all requests are greedy.
  • Kept sample_recovered_tokens as an eager compatibility/test helper, while the production rejection path now performs recovery lazily inside rejection_random_sample_kernel.

Local validation after cleanup:

  • py_compile passed.
  • ruff-check passed.
  • ruff-format passed.
  • typos passed.
  • pytest --collect-only tests/v1/sample/test_rejection_sampler.py passed locally.

Remaining validation needed:

  • I am working on tests and updated/verified benchmark.

The core change is now narrower: eager recovered-token sampling is removed from the production rejection path, but the eager helper remains for test/compatibility coverage.

Compute recovered tokens only when rejection occurs instead of eagerly scanning the vocabulary for every speculative position. Preserve the eager recovery helper for compatibility tests and restore existing sampling edge-case behavior.

Constraint: vLLM DCO requires a Signed-off-by trailer matching the human contributor identity.

Rejected: keeping merge-heavy PR history | unsigned merge commits keep DCO failing and obscure the actual change.

Confidence: medium

Scope-risk: moderate

Directive: validate CUDA/Triton rejection sampler tests and benchmarks before treating the kernel change as merge-ready.

Tested: py_compile; ruff-check; ruff-format; typos; pytest collect-only for tests/v1/sample/test_rejection_sampler.py on Apple Silicon via uv.

Not-tested: CUDA/Triton runtime tests; GPU microbenchmark; end-to-end throughput benchmark after the local fixes.
Signed-off-by: Foad Abo Dahood <foad.abo.dahood@ibm.com>
@masterFoad masterFoad force-pushed the lazy-speculative-rejection branch from ca9ce98 to f312c12 Compare May 22, 2026 18:13
masterFoad and others added 2 commits May 23, 2026 06:02
Switch production lazy recovery from eager per-vocab exponential race state to per-token inverse-CDF thresholds, while keeping the eager helper's exponential-race path for compatibility coverage. This improves the real PR result in the target high-acceptance path without claiming bit-identical sample streams against the previous RNG algorithm.

Constraint: vLLM PR policy requires human review, duplicate-work note, test evidence, and AI-assistance disclosure.

Rejected: Preserve exact exponential-race RNG stream | it kept the high-acceptance benchmark near flat and continued to allocate full-vocab recovery noise.

Confidence: medium

Scope-risk: moderate

Directive: Treat this as distribution-equivalent, not seeded-output-identical, unless upstream requires preserving the old recovery RNG stream.

Tested: git diff --check; py_compile rejection sampler and tests; uvx ruff check changed files; OpenShift A100 full rejection-kernel smoke vllm-final-smoke-132730; OpenShift A100 distribution check vllm-cdf-dist-114215; OpenShift A100 benchmark vllm-cdf-bench-113122.

Not-tested: full vLLM CI suite.

Co-authored-by: OpenAI Codex

Signed-off-by: foad abo dahood <foad.abo.dahood@ibm.com>
@masterFoad masterFoad changed the title Lazy recovery evaluation for speculative rejection sampling Avoid eager recovery sampling in speculative rejection May 23, 2026
@masterFoad
Copy link
Copy Markdown
Author

Updated this PR with the latest patch and validation.

Summary:

  • Production lazy recovery now uses one inverse-CDF recovery uniform per draft token instead of full-vocab exponential recovery noise.
  • This keeps the recovery distribution equivalent, but does not preserve bit-identical recovered token IDs versus the old exponential-race RNG stream.
  • Added the benchmark and validation details to the PR body.

A100 OpenShift results:

  • Full rejection-kernel smoke: vllm-final-smoke-132730, passed.
  • Distribution check: vllm-cdf-dist-114215, passed with target/reference improvement ratio 24.63x.
  • End-to-end benchmark: vllm-cdf-bench-113122.
    • Standard ngram: +9.19% median req/s and gen tok/s.
    • Synthetic high acceptance: +4.90% median req/s and gen tok/s.

AI assistance was used. I reviewed the changed lines and the validation results.

@masterFoad
Copy link
Copy Markdown
Author

masterFoad commented May 24, 2026

Added an OpenShift-only performance evidence update to the PR body.

Summary:

  • A100 mechanism check passed. It shows the intended accepted-path optimization directly: eager recovery uses 3,217,408 full-vocab recovery random values per iteration in the artifact shape, while lazy uses 1,024 scalar recovery uniforms. That is 3142.0x fewer values and 1571.0x fewer recovery-buffer bytes.
  • A100 all-accept mechanism timing was eager 1.284 ms vs lazy 0.140 ms, delta -89.10%.
  • A100 first-reject tradeoff probe was slower for lazy in the artifact, eager 1.284 ms vs lazy 2.845 ms, so I documented this as a tradeoff instead of hiding it.
  • A100 serving latency still shows workload-dependent behavior: standard ngram has +33.93% request/output throughput and strong p95/p99 tail wins, while synthetic high-accept has +4.14% throughput and E2EL p99 improvement but mixed TTFT/ITL tails.
  • I also tried an early-exit recovery scan idea. OpenShift serving evidence was mixed, so I did not push it or include it in this PR.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @masterFoad.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant