Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728)#37728
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical issue of state corruption in Mamba models when using CUDA graphs with data parallelism. The root cause was identified as stale entries in the block table for finished requests. The fix introduces a clear_row method in the BlockTable and MultiGroupBlockTable classes to explicitly zero out block table entries on both CPU and GPU when a request is removed. This change is correctly integrated into the InputBatch.remove_request method, ensuring that slots for finished requests are properly cleaned up, thus preventing the use of stale data in subsequent operations, especially within a CUDA graph context. The implementation is sound and directly resolves the described problem.
vllm/v1/worker/block_table.py
Outdated
| num_blocks = self.num_blocks_per_row[row_idx] | ||
| if num_blocks > 0: | ||
| self.block_table.np[row_idx, :num_blocks] = 0 | ||
| self.block_table.gpu[row_idx, :num_blocks] = 0 |
There was a problem hiding this comment.
Do we need to clear the gpu tensor here? Will commit_block_table sync the block_table.np.clear() to gpu?
There was a problem hiding this comment.
Commit is not called in this dummy run path. Also I think direct write per request should be more efficient compared to commit whole table. But this can be optimized by fusing the writes.
There was a problem hiding this comment.
I think do commit full block in dummy run is an OK overhead because the speed is bounded by the dp rank with real task which always commits the block table.
There was a problem hiding this comment.
make sense. Updated.
975edcd to
66a679b
Compare
…llm-project#37728) Summary: we saw zero-token-id response for a linear attention model. This diff fixes it. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works. Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py. Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway. This diff cleans up block table when request is finished. Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id Differential Revision: D97354760
…vllm-project#37728) Summary: we saw zero-token-id response for a linear attention model. This diff fixes it. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works. Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py. Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway. This diff cleans up block table when request is finished. Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id Differential Revision: D97354760
66a679b to
ad8c3c0
Compare
|
Hi @minosfuture, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
heheda12345
left a comment
There was a problem hiding this comment.
LGTM! Can you fix DCO and pre-commit?
…vllm-project#37728) Summary: we saw zero-token-id response for a linear attention model. This diff fixes it. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works. Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py. Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway. This diff cleans up block table when request is finished. Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id Differential Revision: D97354760 Signed-off-by: Ming Yang <minos.future@gmail.com>
ad8c3c0 to
d65dc26
Compare
…vllm-project#37728) (vllm-project#37728) Summary: we saw zero-token-id response for a linear attention model. This diff fixes it. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works. Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py. Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway. This diff cleans up block table when request is finished. Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id Differential Revision: D97354760 Pulled By: minosfuture
d65dc26 to
4d89ec8
Compare
there's conflict between having DCO and keeping it in sync with meta internal repo:
can we force merge this PR after CI passes? cc @houseroad |
…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…vllm-project#37728) (vllm-project#37728) (vllm-project#37728) Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Summary:
we saw zero-token-id response for a linear attention model. This diff fixes it.
Root cause is due to using stale mamba block, and this is triggered by DP dummy_run. It happens when one rank finishes a batch while other ranks are still running. dummy_run(1) generates seq_len of [1,0,0,0..] to match other ranks' num_decode. The seq len 0 value indirectly maps to a stale mamba block gdn_attn.py. This padding is only needed for DP full cuda graph. That explains why TP or piecewise cuda graph works.
Padding up to cuda graph captured size itself works fine because num_reqs is the actual number of requests, and it correctly masks the block id of rest padded reqs properly in gpu_model_runner.py.
Also due to num_decodes = num_reqs (from attention/utils.py), the mask of block id beyond num_decodes here gdn_attn.py is no-op. Stale block id is within num_decodes requests anyway.
This diff cleans up block table when request is finished.
Test Plan: stress test with DP2 + prefix caching, verify no NaN/zero-token-id
Differential Revision: D97354760
Pulled By: minosfuture