[Bug] Refactor max_num_batched_tokens to account for drafting by benchislett · Pull Request #34898 · vllm-project/vllm

benchislett · 2026-02-19T17:04:08Z

Purpose

Alternative bugfix to #34671. Solves a crash of specdec on main.

To solve the consistency issue, we directly modify max_num_batched_tokens when initializing the VllmConfig, and then decrease is specifically in the scheduler so that the scheduling behaviour is unchanged.

Testing

Tested with Qwen3-Next MTP and GSM8k repeated twice with various concurrencies. All pass with 85% accuracy, matching the non-spec baseline.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct  \
  --tokenizer-mode auto  --gpu-memory-utilization 0.8 \
  --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}' \
  --tensor-parallel-size 2 --port 8042

…cases Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

gemini-code-assist

Code Review

The pull request refactors the handling of max_num_batched_tokens for speculative decoding by extending the limit in VllmConfig to account for drafting slots and then compensating with a subtraction in the V1 scheduler. This ensures that model runner buffers and CUDA graph ranges are correctly sized for the maximum possible batch size during verification. However, there are several critical issues: the extension logic in SpeculativeConfig incorrectly ignores the extra slots needed for serial drafting (which still requires a multi-token verification pass in V1), and the in-place modification of scheduler_config has unintended side effects on the Mamba cache alignment check, the V0 scheduler, and proposer-specific buffer allocations.

vllm/config/speculative.py

gemini-code-assist · 2026-02-19T17:08:34Z

vllm/config/vllm.py

+            self.scheduler_config.max_num_batched_tokens += (
+                self.speculative_config.max_num_new_slots_for_drafting
+                * self.scheduler_config.max_num_seqs
+            )


Modifying scheduler_config.max_num_batched_tokens in place changes the meaning of this field from 'scheduling token budget' to 'maximum buffer capacity'. This has several problematic side effects:

V0 Scheduler Inconsistency: The V0 scheduler (e.g., in vllm/core/scheduler.py) uses this field directly for scheduling decisions and lacks the compensating subtraction logic added to the V1 scheduler. This will result in an unintended increase in the scheduling budget for V0 when speculative decoding is enabled.

Mamba Cache Alignment Check: The check at line 1116 (block_size <= max_num_batched_tokens) now validates against the extended buffer capacity instead of the actual scheduling budget. This could allow configurations where the scheduler is unable to schedule a full block, breaking the alignment requirement for Mamba models.

Double Extension in Proposers: The V1 proposer base class (SpecDecodeBaseProposer in vllm/v1/spec_decode/eagle.py, line 105) performs its own extension logic based on the config value. Since the config value is now already extended, the proposer's internal buffers will be double-extended, wasting GPU memory.

Consider storing the original scheduling budget separately or ensuring that all components are updated to distinguish between the scheduling limit and the buffer capacity.

I don't think we need to consider V0 here. Especially since this is only for speculative decoding. Is V0 still a concern?

This is what we're trying to fix. Everything in the model runner should assume the larger buffer size, for safety.

Fixed.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

LucasWilkinson

Do you think it would make sense to just clone the vllm_config in WorkerBase.__init__ and only call _extend_max_num_batched_tokens_for_drafting on that config instead of modifying the scheduler? Maybe thats too susceptible to consistency bugs.

I guess another option would be add a max_num_tokens_per_forward_pass property to VllmConfig that is extend range. This would help create a seperation of concerns between scheduler_config and the model executor, the model executor cares about max_num_tokens_per_forward_pass, the scheduler cares about scheduler_config. We'd have to audit all scheduler_config.max_num_batched_tokensusages though to see if they should bemax_num_tokens_per_forward_pass` :/

LucasWilkinson

I think this is fine for now to fix main. I really dislike the scheduler changes though (feels a bit spaghetti), I think we should try to get rid of the scheduler changes in a follow up soon

DarkLight1337 · 2026-02-20T04:13:35Z

You can merge from main now, CI should be fixed

tdoublep · 2026-02-20T13:24:48Z

I'm trying to test the changes from this branch but when I do:

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --tensor-parallel-size 4

It seems to get stuck forever running a bunch of compilation behind the scenes. Whereas on main this does not happen.

update: seems to work after merging in newest commits on main

DarkLight1337 · 2026-02-20T14:21:42Z

Please fix https://buildkite.com/vllm/ci/builds/52393/steps/canvas?jid=019c7962-2b3e-4e59-9105-6a9d632edf7b&tab=output

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett · 2026-02-20T16:31:22Z

Modified the implementation to fix the bug and be more in line with Lucas' suggestions. PTAL

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

dosubot · 2026-02-22T16:18:54Z

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

refactor the max_num_batched_tokens handling to account for drafting …

5597abc

…cases Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from ApostaC, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, njhill, orozery, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners February 19, 2026 17:04

mergify bot added v1 bug Something isn't working labels Feb 19, 2026

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

avoid double counting issue in eagle.py

ea437bb

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested a review from luccafong as a code owner February 19, 2026 17:13

mergify bot added the speculative-decoding label Feb 19, 2026

LucasWilkinson reviewed Feb 19, 2026

View reviewed changes

LucasWilkinson approved these changes Feb 19, 2026

View reviewed changes

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 19, 2026

Merge branch 'main' into bchislett/update-scheduler-slots-for-drafting

59f30ea

tdoublep mentioned this pull request Feb 19, 2026

[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support #33726

Merged

Merge branch 'main' into bchislett/update-scheduler-slots-for-drafting

afa6d72

Merge branch 'main' into bchislett/update-scheduler-slots-for-drafting

db7f78e

use max_num_scheduled_tokens instead

68e2f28

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fix None

6522651

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett mentioned this pull request Feb 20, 2026

Update max_num_tokens value when specdec is enabled #34671

Closed

5 tasks

Merge branch 'main' into bchislett/update-scheduler-slots-for-drafting

2b14ad2

benchislett merged commit 682566b into vllm-project:main Feb 22, 2026
49 checks passed

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Feb 22, 2026

[Bug] Refactor max_num_batched_tokens to account for drafting (vllm-p…

87e1cc5

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026

[Bug] Refactor max_num_batched_tokens to account for drafting (vllm-p…

8c689df

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

luccafong mentioned this pull request Feb 24, 2026

[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding #34474

Open

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[Bug] Refactor max_num_batched_tokens to account for drafting (vllm-p…

e4b327b

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Bug] Refactor max_num_batched_tokens to account for drafting (vllm-p…

7dad2f1

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026

[Bug] Refactor max_num_batched_tokens to account for drafting (vllm-p…

631a1f5

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[Bug] Refactor max_num_batched_tokens to account for drafting (vllm-p…

3c9545a

…roject#34898) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Refactor max_num_batched_tokens to account for drafting#34898

[Bug] Refactor max_num_batched_tokens to account for drafting#34898
benchislett merged 8 commits intovllm-project:mainfrom
CentML:bchislett/update-scheduler-slots-for-drafting

benchislett commented Feb 19, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 19, 2026

Uh oh!

benchislett Feb 19, 2026

Uh oh!

LucasWilkinson left a comment •

edited

Loading

Uh oh!

LucasWilkinson left a comment •

edited

Loading

Uh oh!

DarkLight1337 commented Feb 20, 2026

Uh oh!

tdoublep commented Feb 20, 2026 •

edited

Loading

Uh oh!

DarkLight1337 commented Feb 20, 2026

Uh oh!

benchislett commented Feb 20, 2026

Uh oh!

Uh oh!

dosubot bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

benchislett commented Feb 19, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Feb 20, 2026

Uh oh!

tdoublep commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Feb 20, 2026

Uh oh!

benchislett commented Feb 20, 2026

Uh oh!

Uh oh!

dosubot bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benchislett commented Feb 19, 2026 •

edited by github-actions bot

Loading

LucasWilkinson left a comment •

edited

Loading

LucasWilkinson left a comment •

edited

Loading

tdoublep commented Feb 20, 2026 •

edited

Loading