[Bugfix] Fix NemotronH MTP + Chunked Prefill#35447
[Bugfix] Fix NemotronH MTP + Chunked Prefill#35447tdoublep merged 16 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in NemotronH models when using Multi-Token Prediction (MTP) with chunked prefill. The core fix in vllm/v1/attention/backends/mamba_attn.py correctly handles cases where small prefill chunks are misclassified as decodes, preventing an assertion failure. While this fix is sound, the pull request includes some changes that require attention. There's a block of dead code in vllm/v1/worker/gpu_model_runner.py that seems to be a work-in-progress and should be cleaned up before merging. Additionally, an unrelated change in vllm/model_executor/layers/layernorm.py has been identified, which should be reverted to avoid potential side effects.
|
Hi @benchislett, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @benchislett, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @benchislett, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
206986e to
7b5f9a7
Compare
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
Hi @benchislett, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Not sure why pre-commit is breaking. Having a hard time figuring out what the discrepancy is |
vllm/v1/worker/gpu_model_runner.py
Outdated
| # with query_len <= reorder_batch_threshold as "decodes". Prefill | ||
| # chunks that fall under this threshold get processed via the decode | ||
| # path, which stores intermediate states at sequential slots. We must |
There was a problem hiding this comment.
Do we really want to have these prefill chunks processed by the decode path?
There was a problem hiding this comment.
This is how it's done in all other attention backends. It's only GDN attention that does it differently, by manually splitting out the spec-decodes and non-spec-decodes. In my opinion, that strategy is inefficient and more difficult to maintain.
vllm/v1/worker/gpu_model_runner.py
Outdated
| # If this model has mamba2 layers, we handle num_accepted_tokens_cpu differently | ||
| self.is_mamba2_hybrid: bool = False |
There was a problem hiding this comment.
I guess we'd prefer not to have mamba2 specific logic in GPU model runner if possible
There was a problem hiding this comment.
Makes sense. I can try to refactor the behaviour out of gpu model runner and into a helper somewhere
There was a problem hiding this comment.
Moved the core logic into mamba_utils and renamed this toggle. But I'm still not sure what's the cleanest way to set this flag so that we can dispatch based on mamba2 vs gdn attention
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
tdoublep
left a comment
There was a problem hiding this comment.
Generally LGTM but would like @asafgardin's eyes on it from the Mamba1 perspective
|
Seems like there is a different (?) corner case being explored here: #32716 |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Thanks @tdoublep |
|
No idea why the hybrid test is hanging in CI. It passes locally. |
|
Looking into it. |
| if self.use_spec_decode: | ||
| if self.use_spec_decode and num_accepted_tokens is not None: | ||
| assert query_start_loc_d is not None | ||
| assert num_accepted_tokens is not None |
There was a problem hiding this comment.
is this assertion still necessary?
|
Tagging @vadiklyutiy to help with the CI issue, I cannot reproduce it |
|
I can reproduce the hang locally on an L4 GPU (but not on H100) |
|
It looks like for the test that's hanging we only have 38 blocks available on L4, and the request requires 100+ blocks (we need one block per speculative token in MTP for hybrid models) so it just sits there waiting to be scheduled. Can we change the test to require less blocks? |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
added marks to limit the test to >= H100, and added coverage of nemotron-h + MTP |
Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
Purpose
Functional
Test Plan
This branch adds a reproducer which causes garbage outputs with NemotronH MTP + Chunked Prefill.
It does not seem to happen with Qwen3-Next due to differences in how they separate out the spec decodes from non-spec decodes (GDN checks num_draft_tokens_cpu and dynamically splits the decodes into spec and non-spec).
Test Result
The diff in this branch fixes the reproducer giving results consistent with baseline. Further evaluation required.