[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding by LucasWilkinson · Pull Request #34474 · vllm-project/vllm

LucasWilkinson · 2026-02-12T23:27:07Z

Summary

Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting.

Problem: When running with MTP speculative decoding, the server crashes during warmup:

AssertionError at vllm/v1/worker/gpu_model_runner.py:4677
    assert num_tokens <= self.scheduler_config.max_num_batched_tokens

Root cause: PR #32887 (Unified Parallel Drafting) added compile range extension in _set_compile_ranges for speculative decoding to accommodate drafter batches. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run.

Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled.

Separate Issue: FP8 sparse_attn_indexer Error on Blackwell GPUs

During testing, a separate inference error was observed on Blackwell (B200) GPUs that is unrelated to this fix:

RuntimeError: set_stride is not allowed on a Tensor created from .data or .detach()

This error occurs in vllm/model_executor/layers/sparse_attn_indexer.py -> fp8_paged_mqa_logits when running DeepSeek-V3.2-NVFP4 inference. The stack trace shows the issue is in the FP8 MLA indexer operations, not in the warmup/assertion code. This appears to be a compatibility issue with DeepGEMM/sparse attention on SM100 architecture and should be tracked separately.

Test Plan

Tested server boot with MTP speculative decoding on DeepSeek-V3.2-NVFP4
Server boots successfully without assertion error
Command: vllm serve nvidia/DeepSeek-V3.2-NVFP4 --tensor-parallel-size 4 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

gemini-code-assist

Code Review

This pull request aims to fix an assertion error in _dummy_run for MTP speculative decoding by adjusting the maximum token limit. While the intent is correct, the implementation introduces a potential regression. My review includes a critical fix to prevent lowering the assertion limit incorrectly, ensuring that existing functionality remains intact while addressing the speculative decoding issue.

gemini-code-assist · 2026-02-12T23:28:39Z

vllm/v1/worker/gpu_model_runner.py

+        if self.compilation_config.compile_ranges_split_points:
+            max_dummy_run_tokens = max(
+                self.compilation_config.compile_ranges_split_points
+            )


This logic incorrectly replaces max_dummy_run_tokens instead of taking the maximum of the current value and the new one. If max(self.compilation_config.compile_ranges_split_points) is smaller than self.scheduler_config.max_num_batched_tokens, this change would lower the assertion limit, potentially causing new assertion failures for valid token counts.

For example, tests/compile/test_compile_ranges.py sets max_num_batched_tokens=8192 and compile_ranges_split_points=[8, 32]. With this change, max_dummy_run_tokens would become 32. Since _dummy_run is called with compile_sizes including 64 and 128, this would lead to an AssertionError.

To fix this, you should take the maximum of the existing max_dummy_run_tokens and the value from compile_ranges_split_points.

if compile_points := self.compilation_config.compile_ranges_split_points: max_dummy_run_tokens = max(max_dummy_run_tokens, max(compile_points))

Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range is extended for speculative decoding to accommodate drafter batches, but the assertion in _dummy_run wasn't updated to match. Root cause: PR #32887 added compile range extension in _set_compile_ranges for speculative decoding. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run. Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

luccafong · 2026-02-24T00:46:21Z

vllm/v1/worker/gpu_model_runner.py

+                else 1
+            )
+            max_dummy_run_tokens += multiplier * self.scheduler_config.max_num_seqs
+        assert num_tokens <= max_dummy_run_tokens


adding some assertion messages?

luccafong

looks good, thanks for the fix.

luccafong

did not realize there's a fix already, #34898, revert the approval

mergify · 2026-02-24T18:03:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added v1 bug Something isn't working labels Feb 12, 2026

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

LucasWilkinson force-pushed the fix-mtp-dummy-run-assertion branch 3 times, most recently from 57acb6e to 587ad8e Compare February 12, 2026 23:42

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2026

LucasWilkinson force-pushed the fix-mtp-dummy-run-assertion branch 3 times, most recently from 1ffe3d9 to 7351560 Compare February 13, 2026 18:38

LucasWilkinson requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners February 13, 2026 18:38

LucasWilkinson force-pushed the fix-mtp-dummy-run-assertion branch from 7351560 to 2376568 Compare February 13, 2026 19:09

LucasWilkinson force-pushed the fix-mtp-dummy-run-assertion branch from 2376568 to f9d7402 Compare February 13, 2026 22:51

luccafong reviewed Feb 24, 2026

View reviewed changes

luccafong approved these changes Feb 24, 2026

View reviewed changes

Merge branch 'main' into fix-mtp-dummy-run-assertion

43bd060

luccafong requested changes Feb 24, 2026

View reviewed changes

mergify bot added the needs-rebase label Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding#34474

[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding#34474
LucasWilkinson wants to merge 2 commits intomainfrom
fix-mtp-dummy-run-assertion

LucasWilkinson commented Feb 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

luccafong Feb 24, 2026

Uh oh!

luccafong left a comment

Uh oh!

luccafong left a comment

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LucasWilkinson commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Separate Issue: FP8 sparse_attn_indexer Error on Blackwell GPUs

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

luccafong Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

luccafong left a comment

Choose a reason for hiding this comment

Uh oh!

luccafong left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LucasWilkinson commented Feb 12, 2026 •

edited

Loading