Skip to content

Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728)#37728

Merged
zhuohan123 merged 1 commit intovllm-project:mainfrom
minosfuture:export-D97354760
Mar 24, 2026
Merged

Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728)#37728
zhuohan123 merged 1 commit intovllm-project:mainfrom
minosfuture:export-D97354760

Conversation

@minosfuture
Copy link
Copy Markdown
Contributor

@minosfuture minosfuture commented Mar 21, 2026

Summary:
we saw zero-token-id response for a linear attention model. This diff fixes it.

Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works.

Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py.

Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway.

This diff cleans up block table when request is finished.

Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id

Differential Revision: D97354760

Pulled By: minosfuture

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical issue of state corruption in Mamba models when using CUDA graphs with data parallelism. The root cause was identified as stale entries in the block table for finished requests. The fix introduces a clear_row method in the BlockTable and MultiGroupBlockTable classes to explicitly zero out block table entries on both CPU and GPU when a request is removed. This change is correctly integrated into the InputBatch.remove_request method, ensuring that slots for finished requests are properly cleaned up, thus preventing the use of stale data in subsequent operations, especially within a CUDA graph context. The implementation is sound and directly resolves the described problem.

num_blocks = self.num_blocks_per_row[row_idx]
if num_blocks > 0:
self.block_table.np[row_idx, :num_blocks] = 0
self.block_table.gpu[row_idx, :num_blocks] = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to clear the gpu tensor here? Will commit_block_table sync the block_table.np.clear() to gpu?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit is not called in this dummy run path. Also I think direct write per request should be more efficient compared to commit whole table. But this can be optimized by fusing the writes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think do commit full block in dummy run is an OK overhead because the speed is bounded by the dp rank with real task which always commits the block table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. Updated.

@meta-codesync meta-codesync bot changed the title Fix Mamba state corruption from stale CUDA graph block table entries Fix Mamba state corruption from stale CUDA graph block table entries (#37728) Mar 21, 2026
minosfuture added a commit to minosfuture/vllm that referenced this pull request Mar 21, 2026
…llm-project#37728)

Summary:

we saw zero-token-id response for a linear attention model. This diff fixes it.

Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running.  dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works.

Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py.

Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway.

This diff cleans up block table when request is finished.

Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id

Differential Revision: D97354760
@meta-codesync meta-codesync bot changed the title Fix Mamba state corruption from stale CUDA graph block table entries (#37728) Fix Mamba state corruption from referencing stale block table entries (#37728) Mar 21, 2026
minosfuture added a commit to minosfuture/vllm that referenced this pull request Mar 21, 2026
…vllm-project#37728)

Summary:

we saw zero-token-id response for a linear attention model. This diff fixes it.

Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running.  dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works.

Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py.

Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway.

This diff cleans up block table when request is finished.

Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id

Differential Revision: D97354760
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 21, 2026

Hi @minosfuture, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can you fix DCO and pre-commit?

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 22, 2026
minosfuture added a commit to minosfuture/vllm that referenced this pull request Mar 23, 2026
…vllm-project#37728)

Summary:

we saw zero-token-id response for a linear attention model. This diff fixes it.

Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running.  dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works.

Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py.

Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway.

This diff cleans up block table when request is finished.

Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id

Differential Revision: D97354760

Signed-off-by: Ming Yang <minos.future@gmail.com>
…vllm-project#37728) (vllm-project#37728)

Summary:
we saw zero-token-id response for a linear attention model. This diff fixes it.

Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running.  dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works.

Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py.

Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway.

This diff cleans up block table when request is finished.


Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id

Differential Revision: D97354760

Pulled By: minosfuture
@meta-codesync meta-codesync bot changed the title Fix Mamba state corruption from referencing stale block table entries (#37728) Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) Mar 23, 2026
@minosfuture
Copy link
Copy Markdown
Contributor Author

LGTM! Can you fix DCO and pre-commit?

there's conflict between having DCO and keeping it in sync with meta internal repo:

  1. adding sign-off would make it out of sync and cause "Meta Internal-Only Changes Check" failure.
  2. resyncing them would remove sign-off.

can we force merge this PR after CI passes?

cc @houseroad

@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2026
@zhuohan123 zhuohan123 merged commit c07e2ca into vllm-project:main Mar 24, 2026
51 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 24, 2026
@minosfuture minosfuture deleted the export-D97354760 branch March 24, 2026 21:38
malaiwah pushed a commit to malaiwah/vllm that referenced this pull request Mar 27, 2026
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fb-exported meta-exported nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants