[Spec Decode] Unified Parallel Drafting by benchislett · Pull Request #32887 · vllm-project/vllm

benchislett · 2026-01-22T22:16:31Z

Purpose

This PR implements a single input preparation kernel for draft model support, and parallel drafting both with and without hidden states from the target model. As such we now have support for AMD's PARD, which proposes parallel drafting for fine-tuned external draft models, and AWS' P-EAGLE which implements parallel prediction for EAGLE3. Both of these are benchmarked as part of this PR effort.

Testing

E2E tests for parallel drafting and unit tests for the input preparation logic are all passing locally. Confirmed that E2E tests for draft models and EAGLE3 are also still passing locally.

Benchmarks

Benchmarks were conducted for AWS's 2-layer P-EAGLE on GPT-OSS 120B using Acceptance Lengths calculated by averaging AL over each of the MTBench categories, with 2048 max output tokens. The baseline is NVIDIA's EAGLE3 short-context. I also compare AMD's PARD Llama 3.2 1B for Llama 3.3 70B NVFP4, with the autoregressive drafter as a baseline. All benchmarks on 1xB200.

Best config for GPT-OSS at BS=1 is P-EAGLE with K=7, with ~560 output TPS, a speedup of 1.52x over baseline and 1.12x over EAGLE3 best config. At BS=8, P-EAGLE is optimal with K=3, a speedup of 1.34x over baseline and 1.07x over best EAGLE3.

Best config for Llama 3.3 70B-NVFP4 at BS=1 is PARD Llama-1B with K=11, with ~254 output TPS, a speedup of 3.10x over baseline and 1.61x over vanilla draft-model. At BS=8, PARD is optimal with K=7, a speedup of 2.87x over baseline and 1.61x over vanilla draft-model.

All data, with best-at-concurrency bolded for each model.

Checkpoint	K	BS	AL	Median Iter Time	Est. TPS
GPT-OSS 120B EAGLE3	0	1	1	0.0027	370
GPT-OSS 120B EAGLE3	3	1	2.4015427	0.0048	500.3213958
GPT-OSS 120B EAGLE3	4	1	2.5880982	0.0055	470.5633091
GPT-OSS 120B EAGLE3	5	1	2.685875283	0.0061	440.3074234
GPT-OSS 120B EAGLE3	7	1	2.776950087	0.007	396.7071553
GPT-OSS 120B P-EAGLE	0	1	1	0.0027	370
GPT-OSS 120B P-EAGLE	2	1	2.13727217	0.0042	508.8743262
GPT-OSS 120B P-EAGLE	3	1	2.3888	0.0043	555.5348837
GPT-OSS 120B P-EAGLE	4	1	2.545543	0.0046	553.378913
GPT-OSS 120B P-EAGLE	5	1	2.621487	0.0048	546.143125
GPT-OSS 120B P-EAGLE	7	1	2.6937	0.0048	*561.1875*
Llama 70B-NVFP4 Draft 1B	0	1	1	0.0122	81.96721311
Llama 70B-NVFP4 Draft 1B	3	1	2.964271208	0.0203	146.0232122
Llama 70B-NVFP4 Draft 1B	5	1	3.827371847	0.0243	157.5050143
Llama 70B-NVFP4 Draft 1B	7	1	4.408360614	0.0289	152.5384295
Llama 70B-NVFP4 PARD 1B	0	1	1	0.0122	81.96721311
Llama 70B-NVFP4 PARD 1B	3	1	2.72759473	0.0142	192.0841359
Llama 70B-NVFP4 PARD 1B	5	1	3.251206494	0.0141	230.5820208
Llama 70B-NVFP4 PARD 1B	7	1	3.568283773	0.0141	253.0697711
Llama 70B-NVFP4 PARD 1B	11	1	3.68784371	0.0145	*254.334049*
Llama 70B-NVFP4 PARD 1B	15	1	3.678693636	0.0145	253.7030094

GPT-OSS 120B EAGLE3	0	8	1	0.0045	1777.777778
GPT-OSS 120B EAGLE3	2	8	2.146052648	0.008	2146.052648
GPT-OSS 120B EAGLE3	3	8	2.4015427	0.0086	2233.993209
GPT-OSS 120B EAGLE3	4	8	2.5880982	0.01	2070.47856
GPT-OSS 120B EAGLE3	5	8	2.685875283	0.0108	1989.537247
GPT-OSS 120B EAGLE3	7	8	2.776950087	0.0117	1898.769291
GPT-OSS 120B P-EAGLE	0	8	1	0.0045	1777.777778
GPT-OSS 120B P-EAGLE	2	8	2.13727217	0.0080	2137.27217
GPT-OSS 120B P-EAGLE	3	8	2.3888	0.0080	*2388.8*
GPT-OSS 120B P-EAGLE	4	8	2.545543	0.009	2262.704889
GPT-OSS 120B P-EAGLE	5	8	2.621487	0.0094	2231.052766
GPT-OSS 120B P-EAGLE	7	8	2.6937	0.0095	2268.378947
Llama 70B-NVFP4 Draft 1B	0	8	1	0.0118	677.9661017
Llama 70B-NVFP4 Draft 1B	3	8	2.964271208	0.0208	1140.104311
Llama 70B-NVFP4 Draft 1B	5	8	3.827371847	0.0258	1186.781968
Llama 70B-NVFP4 Draft 1B	7	8	4.408360614	0.0299	1179.494479
Llama 70B-NVFP4 PARD 1B	0	8	1	0.012	666.6666667
Llama 70B-NVFP4 PARD 1B	3	8	2.72759473	0.0143	1525.927122
Llama 70B-NVFP4 PARD 1B	5	8	3.251206494	0.0147	1769.364078
Llama 70B-NVFP4 PARD 1B	7	8	3.568283773	0.0149	*1915.857059*
Llama 70B-NVFP4 PARD 1B	11	8	3.68784371	0.0156	1891.201903
Llama 70B-NVFP4 PARD 1B	15	8	3.678693636	0.0158	1862.629689

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify · 2026-01-22T22:17:09Z

Documentation preview: https://vllm--32887.org.readthedocs.build/en/32887/

gemini-code-assist

Code Review

This pull request introduces a unified parallel drafting mechanism for speculative decoding, combining logic for EAGLE and other draft models. The changes are extensive, primarily refactoring the speculative decoding logic into a base proposer class and adding a new, complex Triton kernel for preparing inputs. While the overall refactoring appears sound, I've identified a potential critical issue in the new Triton kernel where a safeguard against out-of-bounds memory access is not being used, which could lead to memory corruption.

vllm/v1/spec_decode/utils.py

vllm/v1/spec_decode/eagle.py

vllm/v1/attention/backends/flashinfer.py

vllm/v1/spec_decode/eagle.py

mergify · 2026-01-23T22:26:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify · 2026-01-27T15:37:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

vllm/v1/spec_decode/utils.py

vllm/v1/spec_decode/eagle.py

vllm/model_executor/models/llama_eagle3.py

vllm/v1/spec_decode/eagle.py

zihaoanllm · 2026-01-29T03:27:09Z

Hi @benchislett ,Thanks a lot for your great work!! I tested the PARD integration in vLLM and compared its performance with the PARD repo example. Under the same configuration, the acceptance length is well aligned between the two runs (3.56 vs 3.50).

Below are the full benchmark results and the exact script I used for vLLM testing.

Results

framework	target	draft method	bmk	device	bs	k	baseline tps	PARD tps	speedup	accept length
pard repo with transformers+	L3.1 8B	PARD	mt_bench	A100-40GB	1	8	76.55	197.77	2.58	3.50
vllm	L3.1 8B	PARD	mt_bench	A100-40GB	1	8	77.69	202.29	2.60	3.56

vLLM test script

# Start server
k=8
target=unsloth/Meta-Llama-3.1-8B-Instruct
draft=amd/PARD-Llama-3.2-1B

vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811 \
  --speculative-config '{"model": "'"$draft"'", "method": "draft_model", "num_speculative_tokens": '"$k"', "parallel_drafting": true}' \

# Benchmark
MAX_CONCURRENCY=1
NUM_PROMPTS=80
vllm bench serve --port 8811 \
    --temperature 0 \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts ${NUM_PROMPTS} \
    --max-concurrency ${MAX_CONCURRENCY}

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett · 2026-02-05T17:53:52Z

For future readers, here is a link to AWS's P-EAGLE arxiv paper:
https://arxiv.org/pdf/2602.01469

When using MTP speculative decoding, the compile range is extended by (multiplier * max_num_seqs), but the assertion in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Use the maximum compile range split point as the upper bound when available, instead of max_num_batched_tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

When using MTP speculative decoding, the compile range is extended by (multiplier * max_num_seqs), but the assertion in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Extend the assertion bound for speculative decoding configs only, mirroring the compile range extension logic in _set_compile_ranges. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

When using MTP speculative decoding, the compile range is extended by (multiplier * max_num_seqs), but the assertion in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Extend the assertion bound for speculative decoding configs only, mirroring the compile range extension logic in _set_compile_ranges. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

When using parallel speculative decoding, the compile range is extended by (num_speculative_tokens * max_num_seqs), but the assertion in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Extend the assertion bound for parallel speculative decoding only. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

When using speculative decoding (MTP/Eagle/draft model), the compile range is extended by (multiplier * max_num_seqs), but the assertions in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Extend both assertion bounds in _dummy_run for speculative decoding configs, mirroring the compile range extension logic in _set_compile_ranges. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

When using parallel drafting (MTP/Eagle with parallel_drafting=True), the compile range is extended by (num_speculative_tokens * max_num_seqs) to accommodate drafter batches, but the assertions in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added the compile range extension for parallel drafting. Fix: Extend both assertion bounds in _dummy_run for parallel drafting only, matching the compile range extension logic in _set_compile_ranges. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range was extended for the drafter, but the warmup sizes included this extended range, causing the target model's _dummy_run to be called with sizes exceeding max_num_batched_tokens. Root cause: PR #32887 added compile range extension for speculative decoding to warm up the drafter, but this caused the target model's _dummy_run assertion to fail. Fix approach: Instead of extending the compile range (which affects the target model), we now: 1. Keep the target model's compile range at max_num_batched_tokens 2. Warm up the drafter separately with its extended size in gpu_worker.py This properly separates the warmup concerns - the target model never sees batches larger than max_num_batched_tokens (the scheduler ensures this), while the drafter is warmed up with its extended batch sizes. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range is extended for speculative decoding to accommodate drafter batches, but the assertion in _dummy_run wasn't updated to match. Root cause: PR #32887 added compile range extension in _set_compile_ranges for speculative decoding. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run. Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range is extended for speculative decoding to accommodate drafter batches, but the assertion in _dummy_run wasn't updated to match. Root cause: PR vllm-project#32887 added compile range extension in _set_compile_ranges for speculative decoding. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run. Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

AlecHenx · 2026-03-06T06:26:51Z

vllm/v1/attention/backend.py

            ):
+                max_num_queries_for_spec = (
+                    1
+                    + (2 if speculative_config.parallel_drafting else 1)


why 1+1+ speculative_config.num_speculative_tokens without parallel_drafting here? Big issue here

You are misunderstanding. It's 1 + 1 * num_spec_tokens. So it will be the same.

first draft: working PARD implementation

d1c8724

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot added documentation Improvements or additions to documentation nvidia speculative-decoding labels Jan 22, 2026

github-project-automation bot added this to NVIDIA Jan 22, 2026

mergify bot added the v1 label Jan 22, 2026

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

vllm/v1/spec_decode/utils.py Show resolved Hide resolved

tomasruizt reviewed Jan 23, 2026

View reviewed changes

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

tomasruizt reviewed Jan 23, 2026

View reviewed changes

vllm/v1/spec_decode/eagle.py Show resolved Hide resolved

tomasruizt reviewed Jan 23, 2026

View reviewed changes

vllm/v1/spec_decode/eagle.py Show resolved Hide resolved

benchislett commented Jan 23, 2026

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

benchislett commented Jan 23, 2026

View reviewed changes

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Jan 23, 2026

benchislett added 2 commits January 24, 2026 22:03

PTD EAGLE support

f329963

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

port bugfix from PTD branch

da7f947

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot added the llama Related to Llama models label Jan 24, 2026

benchislett added 4 commits January 24, 2026 22:22

typo

32fc2a0

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

typo

c597662

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

avoid syncs

543e011

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into bchislett/unified-parallel-drafting

9514ff5

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot removed the needs-rebase label Jan 26, 2026

mergify bot added the needs-rebase label Jan 27, 2026

patch

24aba2c

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mgoin reviewed Jan 29, 2026

View reviewed changes

vllm/v1/spec_decode/utils.py Show resolved Hide resolved

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

vllm/model_executor/models/llama_eagle3.py Show resolved Hide resolved

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

benchislett added 2 commits January 30, 2026 19:31

Merge branch 'main' into bchislett/unified-parallel-drafting

e91ae70

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

cleanup for PR

aca2169

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett merged commit af3162d into vllm-project:main Feb 5, 2026
62 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 5, 2026

This was referenced Feb 6, 2026

[Feature]: DFlash implementation #32094

Open

[Model] Introduce first-class DFlash speculative decoding in vLLM V1 #34014

Closed

This was referenced Feb 9, 2026

[main2main] upgrade main 0208 vllm-project/vllm-ascend#6639

Closed

[main2main] upgrade main 0209 vllm-project/vllm-ascend#6643

Closed

LucasWilkinson mentioned this pull request Feb 12, 2026

[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding #34474

Open

shaharmor98 mentioned this pull request Feb 17, 2026

Update max_num_tokens value when specdec is enabled #34671

Closed

5 tasks

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Spec Decode] Unified Parallel Drafting (vllm-project#32887)

f5ed720

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett mentioned this pull request Feb 20, 2026

feat: add parallel token drafting for EAGLE #32628

Closed

4 tasks

haosdent mentioned this pull request Feb 25, 2026

[Bug]: Triton CompilationError in speculative decoding (draft_model) #35091

Open

1 task

ofirzaf mentioned this pull request Feb 25, 2026

[BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 #35298

Merged

5 tasks

This was referenced Feb 28, 2026

[RFC]: Refactor spec decoding methods and Unified Parallel Drafting vllm-project/vllm-ascend#6881

Open

[main][refactor] Align spec_decode with vllm vllm-project/vllm-ascend#6913

Draft

zihaoanllm mentioned this pull request Mar 4, 2026

[Doc] Add Parallel Draft Models #35973

Merged

hmellor mentioned this pull request Mar 4, 2026

Vllm v1 eagle proposer #15346

Closed

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Spec Decode] Unified Parallel Drafting (vllm-project#32887)

3927f74

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

AlecHenx reviewed Mar 6, 2026

View reviewed changes

benchislett mentioned this pull request Mar 12, 2026

[Feat][Spec Decode] DFlash #36847

Open

Uh oh!

Conversation

benchislett commented Jan 22, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing

Benchmarks

Uh oh!

mergify bot commented Jan 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 23, 2026

Uh oh!

mergify bot commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zihaoanllm commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

benchislett commented Feb 5, 2026

Uh oh!

AlecHenx Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benchislett commented Jan 22, 2026 •

edited by github-actions bot

Loading

zihaoanllm commented Jan 29, 2026 •

edited

Loading