Skip to content

[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding#34474

Open
LucasWilkinson wants to merge 2 commits intomainfrom
fix-mtp-dummy-run-assertion
Open

[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding#34474
LucasWilkinson wants to merge 2 commits intomainfrom
fix-mtp-dummy-run-assertion

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Feb 12, 2026

Summary

Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting.

Problem: When running with MTP speculative decoding, the server crashes during warmup:

AssertionError at vllm/v1/worker/gpu_model_runner.py:4677
    assert num_tokens <= self.scheduler_config.max_num_batched_tokens

Root cause: PR #32887 (Unified Parallel Drafting) added compile range extension in _set_compile_ranges for speculative decoding to accommodate drafter batches. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run.

Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled.

Separate Issue: FP8 sparse_attn_indexer Error on Blackwell GPUs

During testing, a separate inference error was observed on Blackwell (B200) GPUs that is unrelated to this fix:

RuntimeError: set_stride is not allowed on a Tensor created from .data or .detach()

This error occurs in vllm/model_executor/layers/sparse_attn_indexer.py -> fp8_paged_mqa_logits when running DeepSeek-V3.2-NVFP4 inference. The stack trace shows the issue is in the FP8 MLA indexer operations, not in the warmup/assertion code. This appears to be a compatibility issue with DeepGEMM/sparse attention on SM100 architecture and should be tracked separately.

Test Plan

  • Tested server boot with MTP speculative decoding on DeepSeek-V3.2-NVFP4
  • Server boots successfully without assertion error
  • Command: vllm serve nvidia/DeepSeek-V3.2-NVFP4 --tensor-parallel-size 4 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

@mergify mergify bot added v1 bug Something isn't working labels Feb 12, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an assertion error in _dummy_run for MTP speculative decoding by adjusting the maximum token limit. While the intent is correct, the implementation introduces a potential regression. My review includes a critical fix to prevent lowering the assertion limit incorrectly, ensuring that existing functionality remains intact while addressing the speculative decoding issue.

Comment on lines +4681 to +4684
if self.compilation_config.compile_ranges_split_points:
max_dummy_run_tokens = max(
self.compilation_config.compile_ranges_split_points
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic incorrectly replaces max_dummy_run_tokens instead of taking the maximum of the current value and the new one. If max(self.compilation_config.compile_ranges_split_points) is smaller than self.scheduler_config.max_num_batched_tokens, this change would lower the assertion limit, potentially causing new assertion failures for valid token counts.

For example, tests/compile/test_compile_ranges.py sets max_num_batched_tokens=8192 and compile_ranges_split_points=[8, 32]. With this change, max_dummy_run_tokens would become 32. Since _dummy_run is called with compile_sizes including 64 and 128, this would lead to an AssertionError.

To fix this, you should take the maximum of the existing max_dummy_run_tokens and the value from compile_ranges_split_points.

        if compile_points := self.compilation_config.compile_ranges_split_points:
            max_dummy_run_tokens = max(max_dummy_run_tokens, max(compile_points))

@LucasWilkinson LucasWilkinson force-pushed the fix-mtp-dummy-run-assertion branch 3 times, most recently from 57acb6e to 587ad8e Compare February 12, 2026 23:42
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2026
@LucasWilkinson LucasWilkinson force-pushed the fix-mtp-dummy-run-assertion branch 3 times, most recently from 1ffe3d9 to 7351560 Compare February 13, 2026 18:38
@LucasWilkinson LucasWilkinson force-pushed the fix-mtp-dummy-run-assertion branch from 7351560 to 2376568 Compare February 13, 2026 19:09
Fix an assertion error during model warmup when using MTP speculative
decoding with parallel drafting. The issue occurred because the compile
range is extended for speculative decoding to accommodate drafter
batches, but the assertion in _dummy_run wasn't updated to match.

Root cause: PR #32887 added compile range extension in _set_compile_ranges
for speculative decoding. This causes warmup sizes to exceed
max_num_batched_tokens, triggering the assertion in _dummy_run.

Fix: Extend the assertion bound in _dummy_run to match the extended
compile range when parallel drafting is enabled.

Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson force-pushed the fix-mtp-dummy-run-assertion branch from 2376568 to f9d7402 Compare February 13, 2026 22:51
else 1
)
max_dummy_run_tokens += multiplier * self.scheduler_config.max_num_seqs
assert num_tokens <= max_dummy_run_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding some assertion messages?

Copy link
Collaborator

@luccafong luccafong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks for the fix.

Copy link
Collaborator

@luccafong luccafong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not realize there's a fix already, #34898, revert the approval

@mergify
Copy link

mergify bot commented Feb 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants