[BugFix] Refactor add max_num_tokens_per_forward_pass to account for drafting#35038
[BugFix] Refactor add max_num_tokens_per_forward_pass to account for drafting#35038LucasWilkinson wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
a92fde0 to
1ab5574
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable refactoring by adding max_num_tokens_per_forward_pass to VllmConfig. This centralizes the logic for calculating the maximum number of tokens in a forward pass, especially when accounting for speculative decoding. The changes correctly replace the scattered and sometimes inconsistent calculations throughout the codebase with this new, unified configuration value. This not only fixes a bug related to buffer allocation for speculative decoding but also significantly improves code clarity and maintainability. The implementation appears solid and the changes are applied consistently across all relevant files. Overall, this is an excellent improvement.
|
For this PR, probably best if you cherry pick: d52d0c3 @LucasWilkinson There is an assertion that happens which I think this commit fixes. This is what https://buildkite.com/vllm/amd-ci/builds/5181/steps/canvas?sid=019c82e1-0b82-4099-8df2-ba5c192cbedb&tab=output#019c82e1-0ec2-4173-8f53-6cfd858358e0/L1530 captured. |
benchislett
left a comment
There was a problem hiding this comment.
While this maintains a desirable UX on the frontend, I think it will be a huge pain to maintain and/or effectively rename max_num_batched_tokens across all of vLLM. Since many buffers are sized based on that, it seems like it would need to be a near-total replacement of what max_num_batched_tokens means in vLLM.
To me the more maintainable solution is still to modify the scheduler to issue fewer tokens per iteration. Open to hear more thoughts on the matter
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Alternative bugfix to #34671. Solves a crash of specdec on main.
To solve the consistency issue, we add a
max_num_tokens_per_forward_passto the VllmConfig, that accounts for drafting.Testing
Tested with Qwen3-Next MTP and GSM8k repeated twice with various concurrencies. All pass with 85% accuracy, matching the non-spec baseline.