[Bugfix] Fix NemotronH MTP + Chunked Prefill by benchislett · Pull Request #35447 · vllm-project/vllm

benchislett · 2026-02-26T22:06:12Z

Purpose

Functional

Test Plan

This branch adds a reproducer which causes garbage outputs with NemotronH MTP + Chunked Prefill.

It does not seem to happen with Qwen3-Next due to differences in how they separate out the spec decodes from non-spec decodes (GDN checks num_draft_tokens_cpu and dynamically splits the decodes into spec and non-spec).

Test Result

The diff in this branch fixes the reproducer giving results consistent with baseline. Further evaluation required.

gemini-code-assist

Code Review

This pull request addresses a bug in NemotronH models when using Multi-Token Prediction (MTP) with chunked prefill. The core fix in vllm/v1/attention/backends/mamba_attn.py correctly handles cases where small prefill chunks are misclassified as decodes, preventing an assertion failure. While this fix is sound, the pull request includes some changes that require attention. There's a block of dead code in vllm/v1/worker/gpu_model_runner.py that seems to be a work-in-progress and should be cleaned up before merging. Additionally, an unrelated change in vllm/model_executor/layers/layernorm.py has been identified, which should be reverted to avoid potential side effects.

vllm/model_executor/layers/layernorm.py

vllm/v1/worker/gpu_model_runner.py

mergify · 2026-02-27T17:53:07Z

Hi @benchislett, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-27T21:22:59Z

Hi @benchislett, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-03-03T15:56:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-03T20:09:31Z

Hi @benchislett, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

vllm/v1/worker/gpu_model_runner.py

mergify · 2026-03-03T23:10:21Z

Hi @benchislett, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

benchislett · 2026-03-04T15:36:32Z

Not sure why pre-commit is breaking. Having a hard time figuring out what the discrepancy is

tdoublep · 2026-03-04T15:33:02Z

vllm/v1/worker/gpu_model_runner.py

+            # with query_len <= reorder_batch_threshold as "decodes". Prefill
+            # chunks that fall under this threshold get processed via the decode
+            # path, which stores intermediate states at sequential slots. We must


Do we really want to have these prefill chunks processed by the decode path?

This is how it's done in all other attention backends. It's only GDN attention that does it differently, by manually splitting out the spec-decodes and non-spec-decodes. In my opinion, that strategy is inefficient and more difficult to maintain.

tdoublep · 2026-03-04T15:33:47Z

vllm/v1/worker/gpu_model_runner.py

+        # If this model has mamba2 layers, we handle num_accepted_tokens_cpu differently
+        self.is_mamba2_hybrid: bool = False


I guess we'd prefer not to have mamba2 specific logic in GPU model runner if possible

Makes sense. I can try to refactor the behaviour out of gpu model runner and into a helper somewhere

Moved the core logic into mamba_utils and renamed this toggle. But I'm still not sure what's the cleanest way to set this flag so that we can dispatch based on mamba2 vs gdn attention

vllm/v1/worker/gpu_model_runner.py

mergify · 2026-03-04T22:33:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

tdoublep

Generally LGTM but would like @asafgardin's eyes on it from the Mamba1 perspective

vllm/v1/worker/gpu_model_runner.py

vllm/model_executor/layers/mamba/ops/mamba_ssm.py

vllm/v1/attention/backends/mamba_attn.py

tdoublep · 2026-03-10T21:56:55Z

Seems like there is a different (?) corner case being explored here: #32716

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Josephasafg · 2026-03-11T07:34:47Z

Generally LGTM but would like @asafgardin's eyes on it from the Mamba1 perspective

Thanks @tdoublep
Since Mamba1 does not yet have support for speculative decoding, this change should be ok

benchislett · 2026-03-13T19:23:03Z

No idea why the hybrid test is hanging in CI. It passes locally.

benchislett · 2026-03-13T19:23:08Z

Looking into it.

Josephasafg · 2026-03-15T08:01:13Z

vllm/v1/attention/backends/mamba_attn.py

-            if self.use_spec_decode:
+            if self.use_spec_decode and num_accepted_tokens is not None:
                assert query_start_loc_d is not None
                assert num_accepted_tokens is not None


is this assertion still necessary?

benchislett · 2026-03-16T20:17:37Z

Tagging @vadiklyutiy to help with the CI issue, I cannot reproduce it

tdoublep · 2026-03-16T21:36:48Z

I can reproduce the hang locally on an L4 GPU (but not on H100)

tdoublep · 2026-03-16T22:03:03Z

It looks like for the test that's hanging we only have 38 blocks available on L4, and the request requires 100+ blocks (we need one block per speculative token in MTP for hybrid models) so it just sits there waiting to be scheduled. Can we change the test to require less blocks?

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett · 2026-03-16T22:29:25Z

added marks to limit the test to >= H100, and added coverage of nemotron-h + MTP

Signed-off-by: wendyliu235 <wenjun.liu@intel.com>

Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

mergify bot added v1 bug Something isn't working labels Feb 26, 2026

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

vllm/model_executor/layers/layernorm.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

benchislett marked this pull request as ready for review February 27, 2026 17:49

benchislett requested review from LucasWilkinson, MatthewBonanni and njhill as code owners February 27, 2026 17:49

benchislett requested a review from tdoublep as a code owner February 27, 2026 21:18

mergify bot added needs-rebase and removed needs-rebase labels Mar 3, 2026

benchislett added 4 commits March 3, 2026 23:03

fix chunked prefill for mamba2 MTP

4c47382

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

reproducer

7fca59d

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

update test case

6ec6aba

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fix assert

7b5f9a7

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett force-pushed the nemotron-h-mtp-chunkedprefill-bugfix branch from 206986e to 7b5f9a7 Compare March 3, 2026 23:04

revert layernorm change

2501932

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett commented Mar 3, 2026

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

tdoublep reviewed Mar 4, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 4, 2026

benchislett added 3 commits March 5, 2026 21:40

gpu-compatible mamba fix

9fbf1e6

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

add coverage to test for align-mode case

b14fd43

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into nemotron-h-mtp-chunkedprefill-bugfix

58de553

mergify bot removed the needs-rebase label Mar 5, 2026

tdoublep approved these changes Mar 10, 2026

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

vllm/model_executor/layers/mamba/ops/mamba_ssm.py Show resolved Hide resolved

vllm/v1/attention/backends/mamba_attn.py Show resolved Hide resolved

tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2026

benchislett mentioned this pull request Mar 10, 2026

[Bugfix] Fix MTP edge case in split_decodes_and_prefills #32716

Closed

5 tasks

benchislett added 2 commits March 10, 2026 20:00

use a smaller model for test

364ce6e

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into nemotron-h-mtp-chunkedprefill-bugfix

a77b706

benchislett added 2 commits March 11, 2026 11:21

Merge branch 'main' into nemotron-h-mtp-chunkedprefill-bugfix

938f9b8

Merge branch 'main' into nemotron-h-mtp-chunkedprefill-bugfix

7a1a98a

Josephasafg reviewed Mar 15, 2026

View reviewed changes

Merge branch 'main' into nemotron-h-mtp-chunkedprefill-bugfix

a4f6afe

benchislett added 2 commits March 16, 2026 22:27

add large gpu marks for e2e test

1cd7c5c

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

remove unused assert

094597f

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

tdoublep merged commit 8a68046 into vllm-project:main Mar 17, 2026
62 checks passed

zhenwei-intel pushed a commit to zhenwei-intel/vllm that referenced this pull request Mar 17, 2026

[Bugfix] Fix NemotronH MTP + Chunked Prefill (vllm-project#35447)

10c4815

LucasWilkinson mentioned this pull request Mar 17, 2026

[Attention] Support distinguishing between short extends and decodes #37303

Merged

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Bugfix] Fix NemotronH MTP + Chunked Prefill (vllm-project#35447)

5617f1c

benchislett deleted the nemotron-h-mtp-chunkedprefill-bugfix branch March 17, 2026 22:09

andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026

[Bugfix] Fix NemotronH MTP + Chunked Prefill (vllm-project#35447)

f33dfb9

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Bugfix] Fix NemotronH MTP + Chunked Prefill (vllm-project#35447)

8a014c8

Signed-off-by: wendyliu235 <wenjun.liu@intel.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Bugfix] Fix NemotronH MTP + Chunked Prefill (vllm-project#35447)

c71671e

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix NemotronH MTP + Chunked Prefill (vllm-project#35447)

a6dafd9

Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix NemotronH MTP + Chunked Prefill (vllm-project#35447)

6ce13c2

Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Bugfix] Fix NemotronH MTP + Chunked Prefill (vllm-project#35447)

172ea30

		# If this model has mamba2 layers, we handle num_accepted_tokens_cpu differently
		self.is_mamba2_hybrid: bool = False

Uh oh!

Conversation

benchislett commented Feb 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

benchislett commented Mar 4, 2026

Uh oh!

tdoublep Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

tdoublep Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdoublep commented Mar 10, 2026

Uh oh!

Josephasafg commented Mar 11, 2026

Uh oh!

benchislett commented Mar 13, 2026

Uh oh!

benchislett commented Mar 13, 2026

Uh oh!

Josephasafg Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett commented Mar 16, 2026

Uh oh!

tdoublep commented Mar 16, 2026

Uh oh!

tdoublep commented Mar 16, 2026

Uh oh!

benchislett commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

benchislett commented Feb 26, 2026 •

edited by github-actions bot

Loading