[Bug] Refactor max_num_batched_tokens to account for drafting#34898
Conversation
…cases Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Code Review
The pull request refactors the handling of max_num_batched_tokens for speculative decoding by extending the limit in VllmConfig to account for drafting slots and then compensating with a subtraction in the V1 scheduler. This ensures that model runner buffers and CUDA graph ranges are correctly sized for the maximum possible batch size during verification. However, there are several critical issues: the extension logic in SpeculativeConfig incorrectly ignores the extra slots needed for serial drafting (which still requires a multi-token verification pass in V1), and the in-place modification of scheduler_config has unintended side effects on the Mamba cache alignment check, the V0 scheduler, and proposer-specific buffer allocations.
vllm/config/vllm.py
Outdated
| self.scheduler_config.max_num_batched_tokens += ( | ||
| self.speculative_config.max_num_new_slots_for_drafting | ||
| * self.scheduler_config.max_num_seqs | ||
| ) |
There was a problem hiding this comment.
Modifying scheduler_config.max_num_batched_tokens in place changes the meaning of this field from 'scheduling token budget' to 'maximum buffer capacity'. This has several problematic side effects:
- V0 Scheduler Inconsistency: The V0 scheduler (e.g., in
vllm/core/scheduler.py) uses this field directly for scheduling decisions and lacks the compensating subtraction logic added to the V1 scheduler. This will result in an unintended increase in the scheduling budget for V0 when speculative decoding is enabled. - Mamba Cache Alignment Check: The check at line 1116 (
block_size <= max_num_batched_tokens) now validates against the extended buffer capacity instead of the actual scheduling budget. This could allow configurations where the scheduler is unable to schedule a full block, breaking the alignment requirement for Mamba models. - Double Extension in Proposers: The V1 proposer base class (
SpecDecodeBaseProposerinvllm/v1/spec_decode/eagle.py, line 105) performs its own extension logic based on the config value. Since the config value is now already extended, the proposer's internal buffers will be double-extended, wasting GPU memory.
Consider storing the original scheduling budget separately or ensuring that all components are updated to distinguish between the scheduling limit and the buffer capacity.
There was a problem hiding this comment.
- I don't think we need to consider V0 here. Especially since this is only for speculative decoding. Is V0 still a concern?
- This is what we're trying to fix. Everything in the model runner should assume the larger buffer size, for safety.
- Fixed.
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Do you think it would make sense to just clone the vllm_config in WorkerBase.__init__ and only call _extend_max_num_batched_tokens_for_drafting on that config instead of modifying the scheduler? Maybe thats too susceptible to consistency bugs.
I guess another option would be add a max_num_tokens_per_forward_pass property to VllmConfig that is extend range. This would help create a seperation of concerns between scheduler_config and the model executor, the model executor cares about max_num_tokens_per_forward_pass, the scheduler cares about scheduler_config. We'd have to audit all scheduler_config.max_num_batched_tokensusages though to see if they should bemax_num_tokens_per_forward_pass` :/
|
You can merge from main now, CI should be fixed |
|
I'm trying to test the changes from this branch but when I do: It seems to get stuck forever running a bunch of compilation behind the scenes. Whereas on main this does not happen. update: seems to work after merging in newest commits on main |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
Modified the implementation to fix the bug and be more in line with Lucas' suggestions. PTAL |
…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Purpose
Alternative bugfix to #34671. Solves a crash of specdec on main.
To solve the consistency issue, we directly modify max_num_batched_tokens when initializing the VllmConfig, and then decrease is specifically in the scheduler so that the scheduling behaviour is unchanged.
Testing
Tested with Qwen3-Next MTP and GSM8k repeated twice with various concurrencies. All pass with 85% accuracy, matching the non-spec baseline.