[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding#34474
[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding#34474LucasWilkinson wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix an assertion error in _dummy_run for MTP speculative decoding by adjusting the maximum token limit. While the intent is correct, the implementation introduces a potential regression. My review includes a critical fix to prevent lowering the assertion limit incorrectly, ensuring that existing functionality remains intact while addressing the speculative decoding issue.
vllm/v1/worker/gpu_model_runner.py
Outdated
| if self.compilation_config.compile_ranges_split_points: | ||
| max_dummy_run_tokens = max( | ||
| self.compilation_config.compile_ranges_split_points | ||
| ) |
There was a problem hiding this comment.
This logic incorrectly replaces max_dummy_run_tokens instead of taking the maximum of the current value and the new one. If max(self.compilation_config.compile_ranges_split_points) is smaller than self.scheduler_config.max_num_batched_tokens, this change would lower the assertion limit, potentially causing new assertion failures for valid token counts.
For example, tests/compile/test_compile_ranges.py sets max_num_batched_tokens=8192 and compile_ranges_split_points=[8, 32]. With this change, max_dummy_run_tokens would become 32. Since _dummy_run is called with compile_sizes including 64 and 128, this would lead to an AssertionError.
To fix this, you should take the maximum of the existing max_dummy_run_tokens and the value from compile_ranges_split_points.
if compile_points := self.compilation_config.compile_ranges_split_points:
max_dummy_run_tokens = max(max_dummy_run_tokens, max(compile_points))57acb6e to
587ad8e
Compare
1ffe3d9 to
7351560
Compare
7351560 to
2376568
Compare
Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting. The issue occurred because the compile range is extended for speculative decoding to accommodate drafter batches, but the assertion in _dummy_run wasn't updated to match. Root cause: PR #32887 added compile range extension in _set_compile_ranges for speculative decoding. This causes warmup sizes to exceed max_num_batched_tokens, triggering the assertion in _dummy_run. Fix: Extend the assertion bound in _dummy_run to match the extended compile range when parallel drafting is enabled. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2376568 to
f9d7402
Compare
| else 1 | ||
| ) | ||
| max_dummy_run_tokens += multiplier * self.scheduler_config.max_num_seqs | ||
| assert num_tokens <= max_dummy_run_tokens |
There was a problem hiding this comment.
adding some assertion messages?
luccafong
left a comment
There was a problem hiding this comment.
looks good, thanks for the fix.
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
Fix an assertion error during model warmup when using MTP speculative decoding with parallel drafting.
Problem: When running with MTP speculative decoding, the server crashes during warmup:
Root cause: PR #32887 (Unified Parallel Drafting) added compile range extension in
_set_compile_rangesfor speculative decoding to accommodate drafter batches. This causes warmup sizes to exceedmax_num_batched_tokens, triggering the assertion in_dummy_run.Fix: Extend the assertion bound in
_dummy_runto match the extended compile range when parallel drafting is enabled.Separate Issue: FP8 sparse_attn_indexer Error on Blackwell GPUs
During testing, a separate inference error was observed on Blackwell (B200) GPUs that is unrelated to this fix:
This error occurs in
vllm/model_executor/layers/sparse_attn_indexer.py->fp8_paged_mqa_logitswhen running DeepSeek-V3.2-NVFP4 inference. The stack trace shows the issue is in the FP8 MLA indexer operations, not in the warmup/assertion code. This appears to be a compatibility issue with DeepGEMM/sparse attention on SM100 architecture and should be tracked separately.Test Plan
vllm serve nvidia/DeepSeek-V3.2-NVFP4 --tensor-parallel-size 4 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'