[BugFix] Refactor add max_num_tokens_per_forward_pass to account for drafting by LucasWilkinson · Pull Request #35038 · vllm-project/vllm

LucasWilkinson · 2026-02-22T01:04:04Z

Purpose

Alternative bugfix to #34671. Solves a crash of specdec on main.

To solve the consistency issue, we add a max_num_tokens_per_forward_pass to the VllmConfig, that accounts for drafting.

Testing

Tested with Qwen3-Next MTP and GSM8k repeated twice with various concurrencies. All pass with 85% accuracy, matching the non-spec baseline.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct  \
  --tokenizer-mode auto  --gpu-memory-utilization 0.8 \
  --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}' \
  --tensor-parallel-size 2 --port 8042

dosubot · 2026-02-22T01:04:13Z

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify · 2026-02-22T01:04:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a valuable refactoring by adding max_num_tokens_per_forward_pass to VllmConfig. This centralizes the logic for calculating the maximum number of tokens in a forward pass, especially when accounting for speculative decoding. The changes correctly replace the scattered and sometimes inconsistent calculations throughout the codebase with this new, unified configuration value. This not only fixes a bug related to buffer allocation for speculative decoding but also significantly improves code clarity and maintainability. The implementation appears solid and the changes are applied consistently across all relevant files. Overall, this is an excellent improvement.

AndreasKaratzas · 2026-02-22T04:12:37Z

For this PR, probably best if you cherry pick: d52d0c3

@LucasWilkinson There is an assertion that happens which I think this commit fixes. This is what https://buildkite.com/vllm/amd-ci/builds/5181/steps/canvas?sid=019c82e1-0b82-4099-8df2-ba5c192cbedb&tab=output#019c82e1-0ec2-4173-8f53-6cfd858358e0/L1530 captured.

benchislett

While this maintains a desirable UX on the frontend, I think it will be a huge pain to maintain and/or effectively rename max_num_batched_tokens across all of vLLM. Since many buffers are sized based on that, it seems like it would need to be a near-total replacement of what max_num_batched_tokens means in vLLM.

To me the more maintainable solution is still to modify the scheduler to issue fewer tokens per iteration. Open to hear more thoughts on the matter

mergify · 2026-02-23T00:20:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson requested review from ProExpertProg, WoosukKwon, benchislett, heheda12345, hmellor, houseroad, jeejeelee, luccafong, mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256 and youkaichao as code owners February 22, 2026 01:04

max-num-tokens-fwd-pass

1ab5574

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify bot added nvidia rocm Related to AMD ROCm speculative-decoding labels Feb 22, 2026

mergify bot added the v1 label Feb 22, 2026

github-project-automation bot added this to NVIDIA and AMD Feb 22, 2026

mergify bot added the needs-rebase label Feb 22, 2026

github-project-automation bot moved this to Todo in AMD Feb 22, 2026

mergify bot added the bug Something isn't working label Feb 22, 2026

LucasWilkinson force-pushed the lwilkinson/max-num-tokens-fwd-pass branch from a92fde0 to 1ab5574 Compare February 22, 2026 01:04

mergify bot removed the needs-rebase label Feb 22, 2026

gemini-code-assist bot reviewed Feb 22, 2026

View reviewed changes

AndreasKaratzas mentioned this pull request Feb 22, 2026

[ROCm][CI] Fix spec decode profile assertion and logprob test determinism #35043

Merged

benchislett requested changes Feb 22, 2026

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Feb 22, 2026

mergify bot added the needs-rebase label Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Refactor add max_num_tokens_per_forward_pass to account for drafting#35038

[BugFix] Refactor add max_num_tokens_per_forward_pass to account for drafting#35038
LucasWilkinson wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:lwilkinson/max-num-tokens-fwd-pass

LucasWilkinson commented Feb 22, 2026 •

edited by github-actions bot

Loading

Uh oh!

dosubot bot commented Feb 22, 2026

Uh oh!

mergify bot commented Feb 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

AndreasKaratzas commented Feb 22, 2026 •

edited

Loading

Uh oh!

benchislett left a comment

Uh oh!

mergify bot commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LucasWilkinson commented Feb 22, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing

Uh oh!

dosubot bot commented Feb 22, 2026

Uh oh!

mergify bot commented Feb 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

AndreasKaratzas commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LucasWilkinson commented Feb 22, 2026 •

edited by github-actions bot

Loading

AndreasKaratzas commented Feb 22, 2026 •

edited

Loading