Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) by minosfuture · Pull Request #37728 · vllm-project/vllm

minosfuture · 2026-03-21T01:19:29Z

Summary:
we saw zero-token-id response for a linear attention model. This diff fixes it.

Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works.

Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py.

Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway.

This diff cleans up block table when request is finished.

Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id

Differential Revision: D97354760

Pulled By: minosfuture

gemini-code-assist

Code Review

This pull request addresses a critical issue of state corruption in Mamba models when using CUDA graphs with data parallelism. The root cause was identified as stale entries in the block table for finished requests. The fix introduces a clear_row method in the BlockTable and MultiGroupBlockTable classes to explicitly zero out block table entries on both CPU and GPU when a request is removed. This change is correctly integrated into the InputBatch.remove_request method, ensuring that slots for finished requests are properly cleaned up, thus preventing the use of stale data in subsequent operations, especially within a CUDA graph context. The implementation is sound and directly resolves the described problem.

heheda12345 · 2026-03-21T02:21:51Z

vllm/v1/worker/block_table.py

+        num_blocks = self.num_blocks_per_row[row_idx]
+        if num_blocks > 0:
+            self.block_table.np[row_idx, :num_blocks] = 0
+            self.block_table.gpu[row_idx, :num_blocks] = 0


Do we need to clear the gpu tensor here? Will commit_block_table sync the block_table.np.clear() to gpu?

Commit is not called in this dummy run path. Also I think direct write per request should be more efficient compared to commit whole table. But this can be optimized by fusing the writes.

I think do commit full block in dummy run is an OK overhead because the speed is bounded by the dp rank with real task which always commits the block table.

make sense. Updated.

…llm-project#37728) Summary: we saw zero-token-id response for a linear attention model. This diff fixes it. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works. Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py. Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway. This diff cleans up block table when request is finished. Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id Differential Revision: D97354760

…vllm-project#37728) Summary: we saw zero-token-id response for a linear attention model. This diff fixes it. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works. Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py. Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway. This diff cleans up block table when request is finished. Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id Differential Revision: D97354760

mergify · 2026-03-21T17:56:37Z

Hi @minosfuture, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

heheda12345

LGTM! Can you fix DCO and pre-commit?

…vllm-project#37728) Summary: we saw zero-token-id response for a linear attention model. This diff fixes it. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works. Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py. Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway. This diff cleans up block table when request is finished. Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id Differential Revision: D97354760 Signed-off-by: Ming Yang <minos.future@gmail.com>

…vllm-project#37728) (vllm-project#37728) Summary: we saw zero-token-id response for a linear attention model. This diff fixes it. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works. Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py. Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway. This diff cleans up block table when request is finished. Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id Differential Revision: D97354760 Pulled By: minosfuture

minosfuture · 2026-03-23T22:36:12Z

LGTM! Can you fix DCO and pre-commit?

there's conflict between having DCO and keeping it in sync with meta internal repo:

adding sign-off would make it out of sync and cause "Meta Internal-Only Changes Check" failure.
resyncing them would remove sign-off.

can we force merge this PR after CI passes?

cc @houseroad

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Vinay Damodaran <vrdn@hey.com>

minosfuture requested a review from njhill as a code owner March 21, 2026 01:19

meta-codesync bot added fb-exported meta-exported labels Mar 21, 2026

mergify bot added nvidia v1 labels Mar 21, 2026

github-project-automation bot added this to NVIDIA Mar 21, 2026

gemini-code-assist bot reviewed Mar 21, 2026

View reviewed changes

heheda12345 reviewed Mar 21, 2026

View reviewed changes

meta-codesync bot changed the title ~~Fix Mamba state corruption from stale CUDA graph block table entries~~ Fix Mamba state corruption from stale CUDA graph block table entries (#37728) Mar 21, 2026

minosfuture force-pushed the export-D97354760 branch from 975edcd to 66a679b Compare March 21, 2026 17:49

meta-codesync bot changed the title ~~Fix Mamba state corruption from stale CUDA graph block table entries (#37728)~~ Fix Mamba state corruption from referencing stale block table entries (#37728) Mar 21, 2026

minosfuture force-pushed the export-D97354760 branch from 66a679b to ad8c3c0 Compare March 21, 2026 17:51

heheda12345 approved these changes Mar 22, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 22, 2026

minosfuture force-pushed the export-D97354760 branch from ad8c3c0 to d65dc26 Compare March 23, 2026 17:31

meta-codesync bot changed the title ~~Fix Mamba state corruption from referencing stale block table entries (#37728)~~ Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) Mar 23, 2026

minosfuture force-pushed the export-D97354760 branch from d65dc26 to 4d89ec8 Compare March 23, 2026 17:35

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2026

zhuohan123 merged commit c07e2ca into vllm-project:main Mar 24, 2026
51 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 24, 2026

minosfuture deleted the export-D97354760 branch March 24, 2026 21:38

RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026

Fix Mamba state corruption from referencing stale block table entries (…

b370881

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Mar 27, 2026

Fix Mamba state corruption from referencing stale block table entries (…

8c087ae

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

Fix Mamba state corruption from referencing stale block table entries (…

e909077

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026

Fix Mamba state corruption from referencing stale block table entries (…

a81e566

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

Fix Mamba state corruption from referencing stale block table entries (…

ac45c83

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

Fix Mamba state corruption from referencing stale block table entries (…

e9b8bc3

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Vinay Damodaran <vrdn@hey.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728)#37728

Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728)#37728
zhuohan123 merged 1 commit intovllm-project:mainfrom
minosfuture:export-D97354760

minosfuture commented Mar 21, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

heheda12345 Mar 21, 2026

Uh oh!

minosfuture Mar 21, 2026

Uh oh!

heheda12345 Mar 21, 2026

Uh oh!

minosfuture Mar 21, 2026

Uh oh!

mergify bot commented Mar 21, 2026

Uh oh!

heheda12345 left a comment

Uh oh!

minosfuture commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

minosfuture commented Mar 21, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

heheda12345 Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

minosfuture Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

heheda12345 Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

minosfuture Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 21, 2026

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

minosfuture commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

minosfuture commented Mar 21, 2026 •

edited by meta-codesync bot

Loading