Skip to content

[BugFix] Refactor add max_num_tokens_per_forward_pass to account for drafting#35038

Open
LucasWilkinson wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:lwilkinson/max-num-tokens-fwd-pass
Open

[BugFix] Refactor add max_num_tokens_per_forward_pass to account for drafting#35038
LucasWilkinson wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:lwilkinson/max-num-tokens-fwd-pass

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Feb 22, 2026

Purpose

Alternative bugfix to #34671. Solves a crash of specdec on main.

To solve the consistency issue, we add a max_num_tokens_per_forward_pass to the VllmConfig, that accounts for drafting.

Testing

Tested with Qwen3-Next MTP and GSM8k repeated twice with various concurrencies. All pass with 85% accuracy, matching the non-spec baseline.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct  \
  --tokenizer-mode auto  --gpu-memory-utilization 0.8 \
  --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}' \
  --tensor-parallel-size 2 --port 8042

@dosubot
Copy link

dosubot bot commented Feb 22, 2026

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@mergify mergify bot added nvidia rocm Related to AMD ROCm speculative-decoding labels Feb 22, 2026
@mergify
Copy link

mergify bot commented Feb 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the v1 label Feb 22, 2026
@mergify mergify bot added the needs-rebase label Feb 22, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 22, 2026
@mergify mergify bot added the bug Something isn't working label Feb 22, 2026
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/max-num-tokens-fwd-pass branch from a92fde0 to 1ab5574 Compare February 22, 2026 01:04
@mergify mergify bot removed the needs-rebase label Feb 22, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable refactoring by adding max_num_tokens_per_forward_pass to VllmConfig. This centralizes the logic for calculating the maximum number of tokens in a forward pass, especially when accounting for speculative decoding. The changes correctly replace the scattered and sometimes inconsistent calculations throughout the codebase with this new, unified configuration value. This not only fixes a bug related to buffer allocation for speculative decoding but also significantly improves code clarity and maintainability. The implementation appears solid and the changes are applied consistently across all relevant files. Overall, this is an excellent improvement.

@AndreasKaratzas
Copy link
Collaborator

AndreasKaratzas commented Feb 22, 2026

For this PR, probably best if you cherry pick: d52d0c3

@LucasWilkinson There is an assertion that happens which I think this commit fixes. This is what https://buildkite.com/vllm/amd-ci/builds/5181/steps/canvas?sid=019c82e1-0b82-4099-8df2-ba5c192cbedb&tab=output#019c82e1-0ec2-4173-8f53-6cfd858358e0/L1530 captured.

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this maintains a desirable UX on the frontend, I think it will be a huge pain to maintain and/or effectively rename max_num_batched_tokens across all of vLLM. Since many buffers are sized based on that, it seems like it would need to be a near-total replacement of what max_num_batched_tokens means in vLLM.

To me the more maintainable solution is still to modify the scheduler to issue fewer tokens per iteration. Open to hear more thoughts on the matter

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Feb 22, 2026
@mergify
Copy link

mergify bot commented Feb 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Todo
Status: In review

Development

Successfully merging this pull request may close these issues.

3 participants