[Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode#32118
[Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode#32118heheda12345 merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Josephasafg <ajgard7@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where new Mamba requests with a small number of tokens were incorrectly classified as decode steps instead of prefills, leading to state corruption and incorrect model outputs. The proposed change correctly re-prioritizes the classification logic by first identifying prefill requests based on whether they have any previously computed tokens. This ensures that new requests are always handled as prefills, which is essential for correctly initializing the recurrent state in models like Mamba. The fix is well-targeted, effective, and makes the logic more robust and easier to understand. I approve of this change.
|
@heheda12345 @tdoublep I'd appreciate your review as well. Thanks |
|
Can you add a unit test like those in test_reorder_batch_to_split_decodes_and_prefills? |
Signed-off-by: Josephasafg <ajgard7@gmail.com>
@heheda12345 Sure! Added two tests. |
…de (vllm-project#32118) Signed-off-by: Josephasafg <ajgard7@gmail.com> Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
|
IIUC, when all requests belong exclusively to either vllm/vllm/v1/attention/backends/utils.py Lines 525 to 542 in 8853a50 |
…de (vllm-project#32118) Signed-off-by: Josephasafg <ajgard7@gmail.com>
…de (vllm-project#32118) Signed-off-by: Josephasafg <ajgard7@gmail.com>
When I tested it when say Same when When I tested where both Do you have a script to reproduce such issue? I'd be happy to help |
To give an example: suppose we are in the MTP3 scenario (i.e., decode_threshold=4) and require_uniform is False. Currently, there is one decode request and one newly arrived prefill request (with a prompt of 3 tokens, although this situation seems quite rare). In the code of this PR, one will belong to is_decode and the other to is_prefill. However, the code in vllm/vllm/v1/attention/backends/utils.py will categorize both of these two requests as decode_reqs. After raising the above comment, I discussed with colleagues the discrimination logic for P/D request in the MTP scenario. In our context, classifying the request with 3 tokens as decode wouldn't cause any issues. However, based on the intent of this PR, I'm not certain whether it would have an impact in the target scenarios of this PR. So I'm just pointing it out as a note. |
@pisceskkk I opened a draft PR with a potential fix for this edge case. We can discuss it there and see if this solution can work. #32716 |
…de (vllm-project#32118) Signed-off-by: Josephasafg <ajgard7@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it? Since the PR (vllm-project/vllm#32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
…-project#5939) ### What this PR does / why we need it? Since the PR (vllm-project/vllm#32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
…de (vllm-project#32118) Signed-off-by: Josephasafg <ajgard7@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…-project#5939) ### What this PR does / why we need it? Since the PR (vllm-project/vllm#32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
…-project#5939) ### What this PR does / why we need it? Since the PR (vllm-project/vllm#32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…-project#5939) ### What this PR does / why we need it? Since the PR (vllm-project/vllm#32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
…-project#5939) ### What this PR does / why we need it? Since the PR (vllm-project/vllm#32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Purpose
Fix stale SSM state corruption when new Mamba requests are scheduled with only 1 token due to token budget exhaustion.
Problem
When the scheduler's token budget is nearly exhausted, new requests may be allocated only 1 token. A request with 1 token could be classified as decode rather than prefill. This causes a prompt to first be decoded on a stale SSM state (meaning it had previous values - this happens after the I believe most of the gpu blocks are used and a reuse takes place). So the prompt first goes through decode, then prefill and then decode again.
Mamba/SSMs use recurrent state that is read AND written each step:
When a new Mamba request runs as decode (1 token), it reads from a cache slot that may contain garbage/stale state from a previously completed request. This corrupted state propagates to all subsequent tokens.
Test Plan
I ran the following to reproduce it -
This is how I configured vLLM instance.
I then ran 1024 prompts, with 8 batches, where each batch was 128 prompts of varying length of up to 4K tokens.
Test Result
Without fix I got gibberish constantly for the same prompt id -
With the fix (proper generation)-
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Cursor Bugbot is generating a summary for commit e46c1dd. Configure here.
Note
Fixes request classification in attention batch reordering to ensure new requests are always treated as prefill, even when only 1 token is scheduled, preserving correct state initialization.
reorder_batch_to_split_decodes_and_prefills(utils.py),is_prefillis now determined solely bynum_computed_tokens == 0;is_decodeandis_extendare computed only for non-prefill requests based onnum_scheduled_tokensvsdecode_thresholdWritten by Cursor Bugbot for commit e46c1dd. This will update automatically on new commits. Configure here.
Note
Fixes request classification in attention batch reordering to always treat new requests (num_computed_tokens == 0) as prefill, regardless of scheduled token count, while keeping decode → extend → prefill ordering.
reorder_batch_to_split_decodes_and_prefillsto deriveis_prefillsolely fromnum_computed_tokens == 0, and computeis_decode/is_extendonly for non-prefill items based ondecode_thresholdWritten by Cursor Bugbot for commit 56bcd6d. This will update automatically on new commits. Configure here.
Note
Prevents new requests from being misclassified as decodes when only a single token is scheduled.
reorder_batch_to_split_decodes_and_prefillsto deriveis_prefillsolely fromnum_computed_tokens == 0;is_decode/is_extendnow apply only to non-prefill items while preserving decode → extend → prefill orderingWritten by Cursor Bugbot for commit 9955fa2. This will update automatically on new commits. Configure here.