feat: add parallel token drafting for EAGLE#32628
feat: add parallel token drafting for EAGLE#32628hai-meh-cs wants to merge 5 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--32628.org.readthedocs.build/en/32628/ |
There was a problem hiding this comment.
Code Review
This pull request introduces Parallel Token Decoding (PTD) for EAGLE speculative decoding, which is a significant performance enhancement. The implementation is well-structured, with a new PtdEagleProposer and a Triton kernel to handle the parallel draft token generation. The changes are consistently integrated across the configuration, model files, and worker. The addition of a new e2e test for PTD correctness is also a good practice, even if it's currently skipped.
My main feedback is regarding the handling of sequences that exceed max_model_len during drafting. The current implementation wraps around the position embeddings, which could lead to poor draft quality and reduced performance. I've left a specific comment with a suggestion on how to address this.
vllm/v1/spec_decode/ptd_eagle.py
Outdated
| out_pos = tl.where(out_pos >= max_model_len, 0, out_pos) | ||
| tl.store(out_positions_ptr + out_idx, out_pos) | ||
|
|
||
| if is_verified: | ||
| slot = tl.load(original_slot_mapping_ptr + in_start + local_idx) | ||
| else: | ||
| last_pos = tl.load(target_positions_ptr + in_start + last_idx) | ||
| draft_pos = last_pos + (local_idx - last_idx) | ||
| draft_pos = tl.where(draft_pos >= max_model_len, 0, draft_pos) |
There was a problem hiding this comment.
The position wrapping logic tl.where(out_pos >= max_model_len, 0, out_pos) for draft tokens that exceed max_model_len could lead to poor draft quality and consequently low acceptance rates for long sequences. When a sequence's length plus the number of speculative tokens (K) exceeds max_model_len, the positions for draft tokens are reset to 0. This provides incorrect positional information to the model, especially for models using positional embeddings like RoPE.
While this prevents out-of-bounds errors, it results in generating low-quality draft tokens that are unlikely to be accepted. This is computationally wasteful.
A better approach would be to cap the number of draft tokens for each sequence to ensure the total length does not exceed max_model_len. This could be done within the propose method before preparing inputs for the kernel.
There was a problem hiding this comment.
haven't confirmed this case, but can see how it would play out. when sequence length approaches max_model_len, draft positions that exceed the limit get wrapped, which would result in low-quality drafts that get rejected. correctness should be preserved but compute would be wasted. will address in a follow-up by capping draft count near max_model_len
There was a problem hiding this comment.
We might have some unit tests for this from previous PRs. Worth taking a look to see if this would be covered by our existing test cases
|
Hi @hai-meh-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
vllm/v1/spec_decode/ptd_eagle.py
Outdated
| in_qsl_cpu[1:batch_size+1] - in_qsl_cpu[:batch_size] | ||
| ) | ||
| out_qsl_cpu = torch.zeros(batch_size + 1, dtype=torch.int32) | ||
| out_qsl_cpu[1:] = torch.cumsum(accepted_lengths_cpu + draft_len, dim=0) |
There was a problem hiding this comment.
CPU/GPU mismatch in query_start_loc calculation
High Severity
The GPU out_qsl is computed using last_token_indices to determine accepted lengths, but out_qsl_cpu ignores last_token_indices and instead uses in_qsl_cpu[1:batch_size+1] - in_qsl_cpu[:batch_size] (the original query lengths). When last_token_indices is explicitly provided (in padded batch mode with rejected tokens), these tensors will have inconsistent values. This mismatch in query_start_loc vs query_start_loc_cpu can cause incorrect attention metadata construction, particularly in backends like FlashInfer that use both CPU and GPU tensors.
Additional Locations (1)
There was a problem hiding this comment.
thanks cursor bot for flagging! will take a look at the cpu/gpu query_start_loc calculation to verify whether there's an actual mismatch and what the impact would be
vllm/v1/spec_decode/ptd_eagle.py
Outdated
| block_offset = draft_pos % block_size | ||
| block_id = tl.load(block_table_ptr + req_idx * max_blocks + block_num) | ||
| slot = block_id * block_size + block_offset | ||
| tl.store(out_slot_mapping_ptr + out_idx, slot) |
There was a problem hiding this comment.
Missing PADDING_SLOT_ID for exceeded max_model_len positions
Medium Severity
When draft positions exceed max_model_len, the Triton kernel clamps draft_pos to 0 and computes a slot based on block 0, instead of using PADDING_SLOT_ID = -1. The base EagleProposer uses PADDING_SLOT_ID to prevent KV cache corruption for out-of-bounds positions. Without this, draft tokens with positions beyond max_model_len could incorrectly update the KV cache at block 0.
There was a problem hiding this comment.
related to the max_model_len edge case above. will look into using PADDING_SLOT_ID for positions that exceed max_model_len to avoid unnecessary KV cache writes
vllm/v1/spec_decode/ptd_eagle.py
Outdated
|
|
||
| total_out = ( | ||
| common_attn_metadata.num_actual_tokens + batch_size * draft_len | ||
| ) |
There was a problem hiding this comment.
Incorrect total_out when rejected tokens present
High Severity
The total_out calculation uses num_actual_tokens (all input tokens including rejected ones), but out_qsl is computed using accepted_lengths (only accepted tokens). When tokens are rejected, total_out > out_qsl[-1]. The kernel grid uses total_out, so for positions beyond out_qsl[-1], the request lookup loop defaults to request 0, computing garbage slot mappings. These garbage values are included in slot_mapping and num_actual_tokens, potentially corrupting the KV cache during attention.
Additional Locations (1)
There was a problem hiding this comment.
will investigate. need to trace through the rejected tokens path to verify whether total_out and output_query_start_loc can actually diverge
| self.K = self.num_speculative_tokens | ||
| self.slot_buffer = torch.zeros( | ||
| self.max_num_tokens, dtype=torch.int64, device=device | ||
| ) |
There was a problem hiding this comment.
Buffer overflow when K > 2 in PTD
High Severity
PtdEagleProposer inherits buffer sizes from the parent class where max_num_tokens = max_num_batched_tokens + max_batch_size. However, PTD processes total_out = num_actual_tokens + batch_size * (K-1) positions in a single pass. When K > 2, total_out can exceed max_num_tokens. The Triton kernel writes to input_ids, positions, hidden_states, and slot_buffer up to position total_out - 1, causing out-of-bounds memory writes that could corrupt memory or crash the system.
Additional Locations (1)
There was a problem hiding this comment.
don't believe this is correct. we've tested with K = 3 ... 8 without buffer issues. the parent class buffer sizing should account for speculative tokens. will double check the allocation logic to confirm
There was a problem hiding this comment.
I think that this does need to account for the new tokens, since we can add up to num_spec_tokens * num_reqs new elements into the batch. I fixed this here on the other branch:
| @@ -104,13 +105,16 @@ def main(args): | |||
| else: | |||
| prompts = get_custom_mm_prompts(args.num_prompts) | |||
|
|
|||
| if args.method == "eagle" or args.method == "eagle3": | |||
| if args.method in ("eagle", "eagle3", "eagle-ptd", "eagle3-ptd"): | |||
There was a problem hiding this comment.
It might be simpler to add something like "parallel_draft" to the speculative config instead of duplicating all the methods that support this technique, since it's largely independent of the architecture used to draft. We will want to support PARD (parallel drafting for external draft models) at some point, which also shares most of this implementation.
There was a problem hiding this comment.
done. replaced method names with parallel_draft: bool config flag
tests/v1/e2e/test_spec_decode.py
Outdated
| ( | ||
| "eagle3-ptd", | ||
| "openai/gpt-oss-120b", | ||
| "PATH_TO_PTD_MODEL", # Replace with actual PTD model path |
There was a problem hiding this comment.
Are you planning to release a model alongside this PR? If not, could you upload a dummy checkpoint or find a compatible EAGLE3 checkpoint to use as a placeholder so that the tests are able to run?
There was a problem hiding this comment.
We are trying to release a checkpoint that are currently re-training with limited datasets. We will upload a dummy checkpoint for now.
|
this is very exciting! |
|
Hi @hai-meh-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Does your PTD model have equivalent number of params as the "vanilla EAGLE3" baseline? Or does it have a bunch of layers? |
Add PTD proposer that generates K draft tokens in a single forward pass using mask tokens, enabling more efficient speculative decoding. - Add PtdEagleProposer with Triton kernel for input preparation - Support eagle-ptd and eagle3-ptd speculative methods - Add test and offline inference example for PTD
- Replace eagle-ptd/eagle3-ptd methods with parallel_draft bool flag - Reuse parent class load_model() in PtdEagleProposer - Load mask_hidden via normal weight loading in model - Improve naming and add Triton kernel documentation
vllm/v1/spec_decode/ptd_eagle.py
Outdated
| if self.method == "eagle3" and self.eagle3_use_aux_hidden_state: | ||
| expected_aux_size = self.hidden_size * 3 | ||
| if self.mask_hidden.shape[-1] == expected_aux_size: | ||
| self.mask_hidden = self.model.combine_hidden_states(self.mask_hidden) |
There was a problem hiding this comment.
This does not seem to be necessary. It seems much more effective to do this projection when preparing the model, or omit it entirely during training
|
I've created a new PR request https://github.com/hai-meh-cs/vllm/pull/1 which targets a new bug feature branch to be merged into your feature branch. |
|
That bug has been fixed in the draft-model side and likely just needs to be propagated into this branch as well. |
It's a bit different issue as parallel drafting needs more than 1 extra, as a result, I pulled the bug fix you mentioned and iterate on that. Now below is the new logic See https://github.com/hai-meh-cs/vllm/pull/1/files |
…ding Parallel draft methods (PTD EAGLE) generate K draft tokens in a single forward pass using mask tokens, which requires larger buffers than sequential drafting. The inherited buffer allocation formula was insufficient, causing crashes under load. Bug manifestation: - Sequential EAGLE: needs max_num_batched_tokens + max_num_seqs tokens - Parallel draft: needs max_num_batched_tokens + max_num_seqs * num_speculative_tokens tokens - Error: "AssertionError: Shape: 8213 out of considered ranges: [(1, 8192)]" This fix addresses three critical issues: 1. Buffer Allocation (ptd_eagle.py): - Corrects max_num_tokens formula for parallel draft generation pattern - Reallocates all buffers (input_ids, positions, hidden_states, slot_buffer) - Adds ~6MB memory overhead (negligible for 3-4x speedup) 2. Compilation Ranges (vllm.py): - Extends compile_ranges_split_points when parallel_draft=True - Ensures CUDA graph compilation handles expanded token counts - Adds informative logging for parallel draft detection The bug was caught during benchmarking with 100 prompts (1600 input, 600 output tokens) where batch size reached 8192 tokens + 7 requests * 3 masks = 8213 tokens, exceeding the compilation range of 8192. Tested-by: Load testing with max batch size configurations Signed-off-by: Li Zhang <lzhanga@amazon.com> Simplify updates to eagle files Signed-off-by: Li Zhang <lzhanga@amazon.com> Minor format updates
[Bugfix][ptd_eagle] Fix buffer overflow in PTD EAGLE speculative deco…
- Use PADDING_SLOT_ID (-1) for overflow draft positions to avoid KV cache writes - Pre-project mask_hidden through fc layer for eagle3 architecture - Add comprehensive unit tests for PTD kernel and proposer Signed-off-by: Jaime Campos Salas <jaime.campos.salas@gmail.com>
|
Hi @hai-meh-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing now that #32887 is merged. |
Purpose
PTD (Parallel Token Decoding) generates K draft tokens in a single forward pass instead of K sequential passes, reducing draft overhead for EAGLE speculative decoding.
vllm serve openai/gpt-oss-120b \ --speculative-config '{"model": "<ptd-draft>", "method": "eagle3-ptd", "num_speculative_tokens": 4}' \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --async-schedulingPerformance
TPOT: Constant vs Linear Growth
Throughput Comparison (C=1)
Speedup vs Vanilla EAGLE3
Status
Output throughput improvement at comparable acceptance rates (within ±2%):
TPOT remains constant at ~2.7-2.9ms regardless of K, while vanilla increases linearly from 3.15ms (K=3) to 4.33ms (K=8).
TTFT is comparable: PTD 33-38ms vs Vanilla 34-44ms.
Best Throughput Comparison
Vanilla peaks at K=3; higher K incurs sequential overhead. PTD EAGLE3 benefits from K=4-6 due to parallel drafting.
Speedup by K Value
TPOT Analysis
PTD EAGLE3 TPOT is constant; vanilla grows linearly with K.
Implications for K Selection
Selecting K involves balancing two factors: higher K can yield more accepted tokens per verification, but also increases draft cost when acceptance is low. With vanilla EAGLE, generating K tokens requires K sequential forward passes, so the optimal K depends on expected acceptance rates.
PTD generates all K tokens in a single forward pass. For the K values tested (3-8), the additional tokens processed per forward have minimal impact on draft latency, as shown in the TPOT data above. This sublinear scaling reduces the cost of higher K settings relative to vanilla EAGLE.
This has a practical benefit: K selection becomes less sensitive to workload characteristics. Acceptance rates vary by prompt type. The benchmark data shows HumanEval averaging ~50% acceptance while MT-Bench averages ~35%. With PTD, choosing a higher K captures more tokens on high-acceptance prompts without proportionally increasing cost on lower-acceptance ones.
The benchmark results reflect this: vanilla EAGLE performs best at K=3, while PTD peaks at K=4-6 depending on workload.
PTD EAGLE3 Results (MT-Bench)
Per-Position Acceptance
PTD EAGLE3 Results (HumanEval)
Per-Position Acceptance
Vanilla EAGLE3 Results (MT-Bench)
Per-Position Acceptance
Vanilla EAGLE3 Results (HumanEval)
Per-Position Acceptance
Benchmark Configuration
Hardware: P5e (H200), TP=1
Draft Models:
nvidia/gpt-oss-120b-Eagle3-short-contextServer:
vllm serve openai/gpt-oss-120b \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --speculative-config '{"model": "<draft>", "method": "<method>", "num_speculative_tokens": <K>}' \ --no-enable-prefix-caching \ --async-schedulingBenchmark (MT-Bench):
Benchmark (HumanEval):
Changes
vllm/v1/spec_decode/ptd_eagle.pyvllm/config/speculative.pyeagle-ptd,eagle3-ptdto EagleModelTypesvllm/v1/worker/gpu_model_runner.pyvllm/model_executor/models/llama_eagle3.pymask_hiddenduring weight loadingvllm/transformers_utils/configs/eagle.pyexamples/offline_inference/spec_decode.pytests/v1/e2e/test_spec_decode.pytest_ptd_correctnessTest Plan