[V1][Spec Decode] Add Dynamic SD#32374
[V1][Spec Decode] Add Dynamic SD#32374ekagra-ranjan wants to merge 45 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
Documentation preview: https://vllm--32374.org.readthedocs.build/en/32374/ |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces Dynamic Speculative Decoding (DSD), a significant performance enhancement for vLLM. The implementation involves profiling to gather runtime statistics and then using those to dynamically adjust the number of speculative tokens. The changes are extensive, adding new scripts for configuration generation, profiling, and a manager for DSD logic. While the overall approach is sound, I've identified several critical issues, including potential server crashes due to division by zero, command injection vulnerabilities in the profiling scripts, and other high-severity bugs that could lead to incorrect behavior or system instability. These issues should be addressed to ensure the feature is robust and secure.
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
…mal K Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @ekagra-ranjan, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Thank you @LucasWilkinson @benchislett @hmellor for having a look! I have addressed the comments so far and updated the PR with async scheduling and padded drafter compatibility. The trend is similar to the previously reported in sync scheduler. I have added summary of the changes for async scheduler in the PR description. |
|
@ekagra-ranjan do we have a solution for CUDA graphs? Are they being employed for all values of K? If so, how do you explain the slight discrepancy at low concurrencies when using dynamic SD? |
|
I did a early pass just didnt have time to respond yet but one thing i am wondering is: it might be alot cleaner to have the scheduler determine the number of draft tokens to generate (i.e. own |
Implement dynamic speculative decoding where the scheduler computes the optimal number of draft tokens (K) based on batch size and acceptance rates. This inverts the original PR's paradigm where model runner decided K. Key changes: - DynamicSpeculativeConfig: Config holding profiled ITL stats per (BS, K) - DynamicSpeculativeDecodingManager: Computes optimal K using goodput = AL/ITL - SchedulerOutput.num_spec_tokens_to_schedule: Scheduler tells model runner how many tokens to speculate - EagleProposer.propose() accepts num_speculative_tokens parameter - Stats tracking in manager for online acceptance rate updates This approach is cleaner because: - Scheduler already knows batch size and tracks acceptance stats - No round-trip needed (scheduler doesn't wait for ModelRunnerOutput) - Placeholder accounting is simpler - scheduler knows K at schedule time Based on PR vllm-project#32374 by ekagra-ranjan. Co-authored-by: Ekagra Ranjan <ekagra.ranjan@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Vibe coded this here: #32374 (comment) to demonstrate the proposal |
|
@LucasWilkinson +1, this seems like the right high-level design to me. I don't see any downsides, @ekagra-ranjan what do you think? Do you foresee any challenges? |
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
Hi @ekagra-ranjan, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
Hi @ekagra-ranjan, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
The DSD manager is now owned by scheduler and the PR has been updated to follow the high level design suggested by @LucasWilkinson. We now also dont pad to max K and operate with the optimal K when dealing with draft_ids in the model runner. The gap bw Dynamic Eagle and vanilla has reduced at BS256 compared to previous case.
Summary of the changes: padded drafter
async scheduling
So Re this:
@benchislett FCG is very likely not working with changing values of K. I dont have the solution for this yet but I'll read the code base around FCG in vLLM more in detail next. @LucasWilkinson said that his prev work on FCG can help here. |
|
This pull request has merge conflicts that must be resolved before it can be |


Why is Dynamic SD needed?
SD methods need to verify K tokens for each sequence during decoding. As BS increases, the effective BS becomes BS * K which increases the compute requirement during verification. When this BS*K goes beyond a critical BS then SD negatively impacts the TPOT. DSD helps by tuning down the K to an optimal value such that we continue to reap the benefits from SD.
Use cases
What this PR does
Addresses #4565
V0 had milestone 0. V1 didn't have any form of Dynamic SD.
This PR implements something between Milestone 2 and 3 of Dynamic SD (DSD) where we dynamically determine the proposed length for speculative decoding using runtime information such as batch size and position level acceptance rate in conjunction with profiled parameters like token acceptance rate (for cold start) and the comparative costs of running the draft versus the target model. This approach allows us to adjust the proposed length in real-time, optimizing performance based on current system conditions.
Before inference happens, the approach uses a representative dataset to profile (similar to how the optimal K is selected for SD w/o Dynamic by iterating on a representative dataset):
During inference runtime, the optimal K is found using:
warmup_stepsbefore it starts using the measured AR so far. Tillwarmup_stepsit uses the AR from the offline profiling on a representative dataset.This balances the cold start problem and allows the system to adapt to running request. There are many ways to extend this strategy like resetting AR after some steps but those are left for future work. The purpose of the PR is to have at least something working in vLLM.
The PR computes the goodput similar TurboSpec. However, there is some change to the formula to make it simpler and easier to extend to future models. For a given BS and K:
goodput = AL / ITLwhere AL is a function of K and ITL is a function of K and BS.TurboSpec on the other hand profiles draft and target separately and builds a regression model which is a function of Model config, KV cache size and batch size to find
goodput. This PR follows a simplified approach where the ITL (inter token latency) of the SD model, i.e., target + draft, is directly noted across batch sizes which encapsulates the model config. This makes the setup easier to adapt when model arch changes like SWA or a new change come into picture in future which would make the equation more complicated. The setup profiles using some given batch sizes (BS) and num of draft (K) and linearly interpolates the values between neighboring values for each BS and K bw min and max values of BS and K. While simple, it works effectively as shown in the results.Results
Offline profiled on MTBench and Tested on MTBench
<style type="text/css"></style>
As we can see,
Offline profiled on MTBench and Tested on InstructCoder
<style type="text/css"></style>
Here, "Dynamic EAGLE" is not using runtime AL at all. As we can see adding runtime AL to goodput calculation after sometime give some minor improvement here so for this dataset MTBench numbers are well transferrable to InstrucrCoder but the runtime AL connection would help in adapting more to current workload.
Cmds
Generate DSD Config
Example of `dynamic_speculative_config.json` generated
Benchmark
We chose 20*MAX_CONCURRENCY as the num of prompt so that each setting has at least 20 batches. Without this since MTBench only has 80 samples so MAX_CONCURRENCY=1 would have 80 batches and MAX_CONCURRENCY=128 will have only 1 BS.
File changes:
vllm/v1/spec_decode/dynamic/generate_config.pyis the master file which schedules different scripts and gets the config which is used by DSD during runtime. It has different stages:vllm/v1/spec_decode/offline.pyis used for it. This isoffline_inference/spec_decode.pybut moved tovllm/so that it can be imported here. This offline script is also used in test in CI so is an important file.vllm bench sweepDynamicSpeculativeConfiginvllm/config/speculative.pywhich holds the config values during DSD profiling. It also has path to the config values.vllm/v1/spec_decode/dynamic/manager.pyis the Dynamic SD Manager which reads the ITL from theDynamicSpeculativeConfiggenerated above and generates optimal K for each BS by interpolating across K and BS during the profiling and then provides it to the SD method during proposal.vllm/v1/worker/gpu_model_runner.pywill initalize the DSD Manager and provide the optimal K for the given BS during inference to resp SD method.spec_decoding_stats_allin scheduler which collects the stats and is used indynamic/manager.pyto compute AR and use the updated values after certainwarmup_stepsAfter Async scheduling and padded drafter compatibility
Similar to the synchronous scheduling
File changes for async and padded drafter
Old approach
### `vllm/v1/core/sched/async_scheduler.py` **Problem**: With async scheduling, when dynamic SD changes the optimal K (e.g., from 5 to 3), there's a pipeline latency issue: the scheduler has already committed accounting (num_computed_tokens, num_output_placeholders) for the in-flight batch using the old K. **Solution**: `_pending_optimal_k`: int | None — stores the optimal K from model output, deferred until the next schedule() call. `_in_flight_decode_req_k`: dict[str, int] — maps req_id -> committed spec token count for decode requests in the most recently dispatched batch. Used to know exactly which requests need accounting correction and by how much.New method
_apply_pending_dynamic_sd_update(): Called at the start ofschedule(). Applies the deferred K update:_spec_token_placeholdersto the new K length (controls how many spec positions the scheduler reserves for future batches → reduces KV block waste)._in_flight_decode_req_k, computesdiff = committed_k - optimal_k. If diff > 0 (K decreased), subtracts diff fromrequest.num_output_placeholdersandrequest.num_computed_tokens. If diff <= 0 (K increased), just updatesrequest.spec_token_idsfor the next scheduling step (can't retroactively add tokens to an in-flight batch).Override
schedule(): Calls_apply_pending_dynamic_sd_update()then delegates tosuper().schedule().Modified
_update_after_schedule(): Resets and populates_in_flight_decode_req_kwithreq_id->cur_num_spec_tokensfor each non-prefill decode request that was just committed with spec tokens > 0.vllm/v1/worker/gpu_model_runner.pyProblem: the model runner still processes (and rejects) zero-padded speculative tokens beyond the optimal K, wasting compute. The SchedulerOutput seen by the model runner still contains the old (larger) K from when the batch was scheduled.
Solution:
New method
_trim_spec_tokens_for_dynamic_sd(scheduler_output): Trimsscheduled_spec_decode_tokensin-place to matchself._optimal_num_speculative_tokens. For each request wherescheduled_k > optimal_kModified
_update_states(): Inserted a call to_trim_spec_tokens_for_dynamic_sd(scheduler_output)before the ngram_gpu handling block. Conditioned on_optimal_num_speculative_tokensis not None anduse_async_schedulingandscheduled_spec_tokens. This ordering ensuresoriginal_num_spec_per_req(saved for ngram_gpu'sprev_num_draft_lenrestoration) is based on the dynamically-trimmed K rather than the over-allocated K.Modified
take_draft_token_ids(): When dynamic SD reduced K belownum_spec_tokens, truncates each request's draft token list to k entries (the GPU tensor is zero-padded tonum_spec_tokensfor scatter indexing, but the scheduler should only see real draft tokens).New Approach
padded drafter
prev_num_spec_tokensduring_copy_draft_token_ids_to_cpu()so that the model runner at Step N+1 can correctly index thedraft_token_idswhereprev_num_spec_tokens(changes) is used for stride instead of num_spec_tokens (fixed)async scheduling
scheduler.pyat step N sets thenum_spec_tokens_to_scheduleto send to model runner at Step Nasync_scheduler.pyupdates the spec token placeholder in_update_after_schedule()at step N so that the scheduler at step at N+1 can account for new K spec tokens to send for verification to the engine. The_spec_token_placeholdersgets saved inrequest.spec_token_idsin _update_after_schedule() of async sched at step N which then gets used to createscheduled_spec_decode_tokenswhich gets consumed asdraft_lenin_prepare_input_ids()So

prev_num_spec_tokensdecides how many draft token ids were drafted at step N anddraft_lendecides how many of them will be verified at Step N+1.draft_len <=prev_num_spec_tokenssince draft_len comes from the token budget we have available in this fwd pass.PENDING (some of them can be done in future PRs):
profiling_client.pyandprofiling_server.py