[Spec Decode] Unified Parallel Drafting#32887
[Spec Decode] Unified Parallel Drafting#32887benchislett merged 24 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
Documentation preview: https://vllm--32887.org.readthedocs.build/en/32887/ |
There was a problem hiding this comment.
Code Review
This pull request introduces a unified parallel drafting mechanism for speculative decoding, combining logic for EAGLE and other draft models. The changes are extensive, primarily refactoring the speculative decoding logic into a base proposer class and adding a new, complex Triton kernel for preparing inputs. While the overall refactoring appears sound, I've identified a potential critical issue in the new Triton kernel where a safeguard against out-of-bounds memory access is not being used, which could lead to memory corruption.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @benchislett ,Thanks a lot for your great work!! I tested the PARD integration in vLLM and compared its performance with the PARD repo example. Under the same configuration, the acceptance length is well aligned between the two runs (3.56 vs 3.50). Below are the full benchmark results and the exact script I used for vLLM testing. Results
vLLM test script # Start server
k=8
target=unsloth/Meta-Llama-3.1-8B-Instruct
draft=amd/PARD-Llama-3.2-1B
vllm serve $target \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--no-enable-prefix-caching \
--port 8811 \
--speculative-config '{"model": "'"$draft"'", "method": "draft_model", "num_speculative_tokens": '"$k"', "parallel_drafting": true}' \
# Benchmark
MAX_CONCURRENCY=1
NUM_PROMPTS=80
vllm bench serve --port 8811 \
--temperature 0 \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path philschmid/mt-bench \
--num-prompts ${NUM_PROMPTS} \
--max-concurrency ${MAX_CONCURRENCY} |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
For future readers, here is a link to AWS's P-EAGLE arxiv paper: |
When using MTP speculative decoding, the compile range is extended by (multiplier * max_num_seqs), but the assertion in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Use the maximum compile range split point as the upper bound when available, instead of max_num_batched_tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
When using MTP speculative decoding, the compile range is extended by (multiplier * max_num_seqs), but the assertion in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Extend the assertion bound for speculative decoding configs only, mirroring the compile range extension logic in _set_compile_ranges. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
When using MTP speculative decoding, the compile range is extended by (multiplier * max_num_seqs), but the assertion in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Extend the assertion bound for speculative decoding configs only, mirroring the compile range extension logic in _set_compile_ranges. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
When using parallel speculative decoding, the compile range is extended by (num_speculative_tokens * max_num_seqs), but the assertion in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Extend the assertion bound for parallel speculative decoding only. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
When using speculative decoding (MTP/Eagle/draft model), the compile range is extended by (multiplier * max_num_seqs), but the assertions in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added MTP/Eagle to the compile range extension logic. Fix: Extend both assertion bounds in _dummy_run for speculative decoding configs, mirroring the compile range extension logic in _set_compile_ranges. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
When using parallel drafting (MTP/Eagle with parallel_drafting=True), the compile range is extended by (num_speculative_tokens * max_num_seqs) to accommodate drafter batches, but the assertions in _dummy_run only checked against max_num_batched_tokens, causing warmup to fail. This was introduced in commit af3162d (Unified Parallel Drafting #32887) which added the compile range extension for parallel drafting. Fix: Extend both assertion bounds in _dummy_run for parallel drafting only, matching the compile range extension logic in _set_compile_ranges. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range was extended for the drafter, but the warmup sizes included this extended range, causing the target model's _dummy_run to be called with sizes exceeding max_num_batched_tokens. Root cause: PR #32887 added compile range extension for speculative decoding to warm up the drafter, but this caused the target model's _dummy_run assertion to fail. Fix approach: Instead of extending the compile range (which affects the target model), we now: 1. Keep the target model's compile range at max_num_batched_tokens 2. Warm up the drafter separately with its extended size in gpu_worker.py This properly separates the warmup concerns - the target model never sees batches larger than max_num_batched_tokens (the scheduler ensures this), while the drafter is warmed up with its extended batch sizes. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range is extended for speculative decoding to accommodate drafter batches, but the assertion in _dummy_run wasn't updated to match. Root cause: PR #32887 added compile range extension in _set_compile_ranges for speculative decoding. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run. Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range is extended for speculative decoding to accommodate drafter batches, but the assertion in _dummy_run wasn't updated to match. Root cause: PR #32887 added compile range extension in _set_compile_ranges for speculative decoding. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run. Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range is extended for speculative decoding to accommodate drafter batches, but the assertion in _dummy_run wasn't updated to match. Root cause: PR vllm-project#32887 added compile range extension in _set_compile_ranges for speculative decoding. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run. Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
| ): | ||
| max_num_queries_for_spec = ( | ||
| 1 | ||
| + (2 if speculative_config.parallel_drafting else 1) |
There was a problem hiding this comment.
why 1+1+ speculative_config.num_speculative_tokens without parallel_drafting here? Big issue here
There was a problem hiding this comment.
You are misunderstanding. It's 1 + 1 * num_spec_tokens. So it will be the same.
Purpose
This PR implements a single input preparation kernel for draft model support, and parallel drafting both with and without hidden states from the target model. As such we now have support for AMD's PARD, which proposes parallel drafting for fine-tuned external draft models, and AWS' P-EAGLE which implements parallel prediction for EAGLE3. Both of these are benchmarked as part of this PR effort.
Testing
E2E tests for parallel drafting and unit tests for the input preparation logic are all passing locally. Confirmed that E2E tests for draft models and EAGLE3 are also still passing locally.
Benchmarks
Benchmarks were conducted for AWS's 2-layer P-EAGLE on GPT-OSS 120B using Acceptance Lengths calculated by averaging AL over each of the MTBench categories, with 2048 max output tokens. The baseline is NVIDIA's EAGLE3 short-context. I also compare AMD's PARD Llama 3.2 1B for Llama 3.3 70B NVFP4, with the autoregressive drafter as a baseline. All benchmarks on 1xB200.
Best config for GPT-OSS at BS=1 is P-EAGLE with K=7, with ~560 output TPS, a speedup of 1.52x over baseline and 1.12x over EAGLE3 best config. At BS=8, P-EAGLE is optimal with K=3, a speedup of 1.34x over baseline and 1.07x over best EAGLE3.
Best config for Llama 3.3 70B-NVFP4 at BS=1 is PARD Llama-1B with K=11, with ~254 output TPS, a speedup of 3.10x over baseline and 1.61x over vanilla draft-model. At BS=8, PARD is optimal with K=7, a speedup of 2.87x over baseline and 1.61x over vanilla draft-model.
All data, with best-at-concurrency bolded for each model.