[Bugfix] Fix DP MTP Dummy Run by benchislett · Pull Request #35243 · vllm-project/vllm

benchislett · 2026-02-24T23:06:08Z

Purpose

The DP dummy run issues a uniform_batch decode with num_reqs=1, num_tokens=1, and uniform_batch=True. This will pad the number of tokens to max_query_len==(1+num_speculative_tokens) when MTP is enabled.

But when preparing num_scheduled_tokens (and then building query_start_loc), we do this:

        elif uniform_decode:
            assert not create_mixed_batch
            num_reqs = min(max_num_reqs, cdiv(num_tokens, max_query_len))
            num_scheduled_tokens_list = [max_query_len] * num_reqs
            if num_tokens % max_query_len != 0:
                num_scheduled_tokens_list[-1] = num_tokens % max_query_len

So the last value gets overridden to 1. Then this sanity check assert fails in split_decodes_and_prefills:

    if require_uniform:
        # check if we are in a padded uniform batch; this is used for full-CGs, some
        # requests may have a query length of 0 but since they are padding its fine
        # to treat them as decodes (ensures num_decodes matches the captured size)
        if torch.all((query_lens == query_lens[0]) | (query_lens == 0)):
            assert num_reqs * query_lens[0] == num_tokens, "tokens not padded correctly"
            return num_reqs, 0, num_tokens, 0  # all decodes

I'm not sure if it would be correct to instead avoid truncating num_scheduled_tokens, or try to handle the padding differently in the decode split, but an easy workaround is just to run the dummy batch with (1+num_speculative_tokens) in the first place. That is what this PR does.

Test Plan

GSM8k:

vllm serve nvidia/DeepSeek-R1-FP4-v2 -dp 4 --enable-expert-parallel --speculative_config '{"num_speculative_tokens":1, "method":"deepseek_mtp"}'

lm_eval \
  --model local-completions \
  --model_args base_url=http://0.0.0.0:8000/v1/completions,model=nvidia/DeepSeek-R1-FP4-v2,tokenized_requests=False,tokenizer_backend=None,num_concurrent=999,timeout=120,max_retries=5,max_length=8192 \
  --tasks gsm8k \
  --num_fewshot 20 \
  --batch_size auto --seed 42

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |    20|exact_match|↑  |0.9530|±  |0.0058|

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

gemini-code-assist

Code Review

This pull request addresses a bug that occurs during a Data Parallelism (DP) dummy run when Multi-Token Prediction (MTP) is enabled. The issue stemmed from an incorrect number of tokens being used for the dummy batch, leading to an assertion failure related to token padding. The fix correctly uses 1 + num_speculative_tokens for the dummy run, aligning with the expected padding and resolving the crash. The change is accurate and improves code clarity by replacing a magic number with a descriptive variable.

LucasWilkinson

Makes sense to me! thanks for the fix!

njhill · 2026-03-06T18:39:02Z

@benchislett could you rebase? I'm not able to update the branch

vadiklyutiy · 2026-03-13T08:43:36Z

@njhill just a friendly reminder about this PR.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

execute_dummy_batch() hardcoded _dummy_run(1, uniform_decode=True), but with MTP speculative decoding the uniform decode query length is 1 + num_speculative_tokens. This caused 'tokens not padded correctly' assertion in split_decodes_and_prefills under multinode DP load. Use uniform_decode_query_len instead of hardcoded 1. Upstream: vllm-project#35243 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

use 1+num_spec_tokens in DP dummy run for mtp

dcc6be4

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot added v1 bug Something isn't working labels Feb 24, 2026

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

LucasWilkinson approved these changes Feb 26, 2026

View reviewed changes

LucasWilkinson enabled auto-merge (squash) February 26, 2026 18:52

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 26, 2026

vadiklyutiy moved this to In review in Qwen3.5 Feb 26, 2026

vadiklyutiy added this to Qwen3.5 Feb 26, 2026

vadiklyutiy mentioned this pull request Feb 26, 2026

[BugFix] Mistakenly passing num_reqs_padded as num_reqs in _dummy_run #34121

Closed

5 tasks

robertgshaw2-redhat mentioned this pull request Mar 2, 2026

[Bigfix] Fix padding in FULL_DECODE path when MTP is enabled for DP case #35690

Closed

5 tasks

LucasWilkinson mentioned this pull request Mar 6, 2026

[MRV2] Extensible CG dispatch rework #35959

Merged

njhill disabled auto-merge March 6, 2026 18:38

Merge branch 'main' into bugfix-dummy-run-dp-mtp

3b3abb5

benchislett requested a review from njhill as a code owner March 9, 2026 14:02

benchislett added 2 commits March 13, 2026 11:49

Merge branch 'main' into bugfix-dummy-run-dp-mtp

c9db7d1

work around mrv2

9ea1c7e

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett merged commit f63ed7b into vllm-project:main Mar 17, 2026
51 of 52 checks passed

github-project-automation bot moved this from In review to Done in Qwen3.5 Mar 17, 2026

benchislett deleted the bugfix-dummy-run-dp-mtp branch March 17, 2026 15:16

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Bugfix] Fix DP MTP Dummy Run (vllm-project#35243)

f209af2

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026

[Bugfix] Fix DP MTP Dummy Run (vllm-project#35243)

0b88555

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Bugfix] Fix DP MTP Dummy Run (vllm-project#35243)

35e9853

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Bugfix] Fix DP MTP Dummy Run (vllm-project#35243)

7846bb0

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix DP MTP Dummy Run#35243

[Bugfix] Fix DP MTP Dummy Run#35243
benchislett merged 4 commits intovllm-project:mainfrom
CentML:bugfix-dummy-run-dp-mtp

benchislett commented Feb 24, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

njhill commented Mar 6, 2026

Uh oh!

vadiklyutiy commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

benchislett commented Feb 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Mar 6, 2026

Uh oh!

vadiklyutiy commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benchislett commented Feb 24, 2026 •

edited by github-actions bot

Loading