Skip to content

[Bugfix] Fix DP MTP Dummy Run#35243

Merged
benchislett merged 4 commits intovllm-project:mainfrom
CentML:bugfix-dummy-run-dp-mtp
Mar 17, 2026
Merged

[Bugfix] Fix DP MTP Dummy Run#35243
benchislett merged 4 commits intovllm-project:mainfrom
CentML:bugfix-dummy-run-dp-mtp

Conversation

@benchislett
Copy link
Collaborator

@benchislett benchislett commented Feb 24, 2026

Purpose

FIX #33899

The DP dummy run issues a uniform_batch decode with num_reqs=1, num_tokens=1, and uniform_batch=True. This will pad the number of tokens to max_query_len==(1+num_speculative_tokens) when MTP is enabled.

But when preparing num_scheduled_tokens (and then building query_start_loc), we do this:

        elif uniform_decode:
            assert not create_mixed_batch
            num_reqs = min(max_num_reqs, cdiv(num_tokens, max_query_len))
            num_scheduled_tokens_list = [max_query_len] * num_reqs
            if num_tokens % max_query_len != 0:
                num_scheduled_tokens_list[-1] = num_tokens % max_query_len

So the last value gets overridden to 1. Then this sanity check assert fails in split_decodes_and_prefills:

    if require_uniform:
        # check if we are in a padded uniform batch; this is used for full-CGs, some
        # requests may have a query length of 0 but since they are padding its fine
        # to treat them as decodes (ensures num_decodes matches the captured size)
        if torch.all((query_lens == query_lens[0]) | (query_lens == 0)):
            assert num_reqs * query_lens[0] == num_tokens, "tokens not padded correctly"
            return num_reqs, 0, num_tokens, 0  # all decodes

I'm not sure if it would be correct to instead avoid truncating num_scheduled_tokens, or try to handle the padding differently in the decode split, but an easy workaround is just to run the dummy batch with (1+num_speculative_tokens) in the first place. That is what this PR does.

Test Plan

GSM8k:

vllm serve nvidia/DeepSeek-R1-FP4-v2 -dp 4 --enable-expert-parallel --speculative_config '{"num_speculative_tokens":1, "method":"deepseek_mtp"}'
lm_eval \
  --model local-completions \
  --model_args base_url=http://0.0.0.0:8000/v1/completions,model=nvidia/DeepSeek-R1-FP4-v2,tokenized_requests=False,tokenizer_backend=None,num_concurrent=999,timeout=120,max_retries=5,max_length=8192 \
  --tasks gsm8k \
  --num_fewshot 20 \
  --batch_size auto --seed 42

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |    20|exact_match|↑  |0.9530|±  |0.0058|

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify bot added v1 bug Something isn't working labels Feb 24, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug that occurs during a Data Parallelism (DP) dummy run when Multi-Token Prediction (MTP) is enabled. The issue stemmed from an incorrect number of tokens being used for the dummy batch, leading to an assertion failure related to token padding. The fix correctly uses 1 + num_speculative_tokens for the dummy run, aligning with the expected padding and resolving the crash. The change is accurate and improves code clarity by replacing a magic number with a descriptive variable.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! thanks for the fix!

@njhill
Copy link
Member

njhill commented Mar 6, 2026

@benchislett could you rebase? I'm not able to update the branch

@benchislett benchislett requested a review from njhill as a code owner March 9, 2026 14:02
@vadiklyutiy
Copy link
Collaborator

@njhill just a friendly reminder about this PR.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
TomerBN-Nvidia added a commit to TomerBN-Nvidia/vllm that referenced this pull request Mar 17, 2026
execute_dummy_batch() hardcoded _dummy_run(1, uniform_decode=True),
but with MTP speculative decoding the uniform decode query length is
1 + num_speculative_tokens. This caused 'tokens not padded correctly'
assertion in split_decodes_and_prefills under multinode DP load.

Use uniform_decode_query_len instead of hardcoded 1.

Upstream: vllm-project#35243

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@benchislett benchislett merged commit f63ed7b into vllm-project:main Mar 17, 2026
51 of 52 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in Qwen3.5 Mar 17, 2026
@benchislett benchislett deleted the bugfix-dummy-run-dp-mtp branch March 17, 2026 15:16
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: DeepSeek-R1-0528 AssertionError: tokens not padded correctly on GB200

4 participants