[SSM] Follow-up fix for Mamba P/D KV transfer by ZhanqiuHu · Pull Request #4 · ZhanqiuHu/vllm

ZhanqiuHu · 2026-03-17T04:38:53Z

Connector-only N-1 prefill fix for Mamba P/D disaggregation. Prevents state corruption where D-side recomputes last prompt token from already-complete state. Single file change: nixl_connector.py (+16 lines), no scheduler changes.

Quick sanity (2p2d, prompt: The capital of France is):

Without fix:

Paris. Thats a weird sentence: The capital of France is is Paris. It repeats is. The user wants exactly that sentence, no

With fix:

Paris. No extra punctuation? The sentence includes a period at the end. The user wrote: The capital of France is Paris. So we

gemini-code-assist

Code Review

This pull request introduces a fix for Mamba/SSM state corruption during P/D KV transfer by ensuring the P-side only prefills N-1 tokens. While the logic is sound, there's a critical issue with how the request object is modified. The function get_num_new_matched_tokens is not guaranteed to be called only once for a request, but the current implementation repeatedly truncates the prompt on subsequent calls, which is a bug. My review includes a suggestion to make this modification idempotent to prevent this issue.

gemini-code-assist · 2026-03-17T04:40:36Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        if (self._is_hma_required
+                and params is not None
+                and params.get("do_remote_decode")):
+            if request.prompt_token_ids and len(request.prompt_token_ids) > 1:
+                request.prompt_token_ids.pop()
+                request._all_token_ids.pop()
+                request.num_prompt_tokens -= 1
+                request.max_tokens = 1


The modification of the request object here introduces a side effect in get_num_new_matched_tokens. This function can be called multiple times for the same request if it remains in the scheduler's waiting queue across several scheduling cycles. The current implementation is not idempotent and will repeatedly pop tokens from prompt_token_ids on each call, leading to an incorrect prompt length.

To fix this, the modification should only happen once. I suggest adding a flag to the kv_transfer_params to track whether the truncation has already been performed.

Suggested change

if (self._is_hma_required

and params is not None

and params.get("do_remote_decode")):

if request.prompt_token_ids and len(request.prompt_token_ids) > 1:

request.prompt_token_ids.pop()

request._all_token_ids.pop()

request.num_prompt_tokens -= 1

request.max_tokens = 1

if (self._is_hma_required

and params is not None

and params.get("do_remote_decode")

and not params.get("_p_side_truncated")):

if request.prompt_token_ids and len(request.prompt_token_ids) > 1:

request.prompt_token_ids.pop()

request._all_token_ids.pop()

request.num_prompt_tokens -= 1

request.max_tokens = 1

params["_p_side_truncated"] = True

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

…terface so that it can be extended for Executor implementations. (vllm-project#36924) Signed-off-by: Guangxiang Du <gxd@google.com>

…llm-project#37349) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…llm-project#37130) Signed-off-by: Andrew Xia <axia@meta.com>

…t#37179) Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

…project#37391) Signed-off-by: jiang1.li <jiang1.li@intel.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

…-project#37386) Signed-off-by: karanb192 <karan@example.com> Co-authored-by: karanb192 <karan@example.com>

Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…ename (vllm-project#37328) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

…project#37322) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>

…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

Signed-off-by: Andy Lo <andy@mistral.ai>

…project#37301) Signed-off-by: Yufeng He <40085740+universeplayer@users.noreply.github.com> Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Yufeng He <40085740+universeplayer@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

) Signed-off-by: Or Ozeri <oro@il.ibm.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…XFP4 MXFP8 MoE (vllm-project#30647) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

…minimax_text_01 (vllm-project#37371) Signed-off-by: XuLiu <xuliu40@gmail.com> Co-authored-by: XuLiu <xuliu40@gmail.com>

Signed-off-by: Xin Yang <xyangx@amazon.com>

) Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal> Signed-off-by: Ronald Xu <ronaldxu@amazon.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>

…put improvement (vllm-project#37340) Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…llm-project#37456) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

…m-project#37398) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

For HMA (Mamba/SSM) models in P/D disaggregation, the prefiller must transfer h(N-1) instead of h(N) so the decoder can correctly recompute the last prompt token. D-side: _hma_prefill_token_count() helper returns N-1 for HMA models, used in get_num_new_matched_tokens so the decoder naturally recomputes the last token from h(N-1). P-side: _truncate_hma_request_for_prefill() truncates prompt to N-1 tokens and sets max_tokens=1. The model computes h(N-1), samples one spurious token (which does NOT update Mamba state), then check_stop fires FINISHED_LENGTH_CAPPED triggering the KV transfer. The P-side truncation is guarded by params["_p_side_truncated"] for idempotency across preemption / re-scheduling cycles. Signed-off-by: ZhanqiuHu <zhu@redhat.com>

- Extract P-side truncation into _truncate_hma_request_for_prefill() - Add _hma_prefill_token_count() helper for D-side N-1 calculation - Explicit do_remote_decode / do_remote_prefill guards at call site - Tighten docstrings with D-side/P-side context Signed-off-by: ZhanqiuHu <zhu@redhat.com>

Signed-off-by: ZhanqiuHu <zhu@redhat.com>

_is_hma_required is True for any non-FullAttention model (including SWA), but the N-1 prefill fix only applies to models with cumulative Mamba state. SWA KV is stateless and doesn't need N-1 treatment. Signed-off-by: ZhanqiuHu <zhu@redhat.com>

Signed-off-by: ZhanqiuHu <zhu@redhat.com>

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

ZhanqiuHu force-pushed the fix/mamba-pd-n1-prefill branch 3 times, most recently from 7ea7321 to 9d9d611 Compare March 17, 2026 18:00

ZhanqiuHu mentioned this pull request Mar 17, 2026

[SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation vllm-project/vllm#37310

Merged

AndreasKaratzas and others added 25 commits March 18, 2026 11:12

[ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm (vllm-project#37330)

58cde5c

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[Perf] Enable dual stream execution of input projection for Qwen3 (vl…

f174000

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

[Hardware][TPU] Add supports_async_scheduling() method to Executor in…

a0dd199

…terface so that it can be extended for Executor implementations. (vllm-project#36924) Signed-off-by: Guangxiang Du <gxd@google.com>

[ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture (v…

8b63257

…llm-project#37349) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[CI] Stabilize test_cpu_offloading by waiting for async offload befor…

ce2ef42

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[responsesAPI] parser.extract_response_outputs can take in token IDs (v…

0e95916

…llm-project#37130) Signed-off-by: Andrew Xia <axia@meta.com>

[XPU] skip unsupported ut and update test_nixl_connector (vllm-projec…

86b7e3c

…t#37179) Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

[kv_offload+HMA][0/N]: Support block-level preemption handling (vllm-…

fcf0687

…project#34805) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

[Bugfix] Avoid OpenMP thread reallocation in CPU torch compile (vllm-…

2618012

…project#37391) Signed-off-by: jiang1.li <jiang1.li@intel.com>

[LoRA] Make LoRA respect language_model_only (vllm-project#37375)

8c31f47

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

fix(glm47): improve tool call parsing and content normalization (vllm…

fad09e8

…-project#37386) Signed-off-by: karanb192 <karan@example.com> Co-authored-by: karanb192 <karan@example.com>

[CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API r…

eaf7c9b

…ename (vllm-project#37328) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[Build] Bump python openai version (vllm-project#32316)

b322b19

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy (vllm-…

17c47fb

…project#37322) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>

[Model] Enable LoRA support for tower and connector in H2OVL (vllm-pr…

cef1f30

…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

[NIXL][Bugfix] metrics & testing minor bug (vllm-project#36051)

98b09dd

Signed-off-by: Andy Lo <andy@mistral.ai>

[kv_offload+HMA][6/N]: Split offloading_connector.py (vllm-project#37405

525f2ee

) Signed-off-by: Or Ozeri <oro@il.ibm.com>

[2/3] Refactor InternVL-based processors (vllm-project#37324)

99267c2

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

elastic_ep: Fix stateless group port races (vllm-project#36330)

de1a86b

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

[Log] Reduce duplicate log (vllm-project#37313)

c373b5c

Signed-off-by: yewentao256 <zhyanwentao@126.com>

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer M…

296839a

…XFP4 MXFP8 MoE (vllm-project#30647) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

standardize load_weights using AutoWeightsLoader for kimi_linear and …

1780839

…minimax_text_01 (vllm-project#37371) Signed-off-by: XuLiu <xuliu40@gmail.com> Co-authored-by: XuLiu <xuliu40@gmail.com>

[Kernel] Add gpt-oss Router GEMM kernel (vllm-project#37205)

b1169d7

Signed-off-by: Xin Yang <xyangx@amazon.com>

Adding deterministic lora benchmarking to vLLM Bench (vllm-project#36057

c9d838f

) Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal> Signed-off-by: Ronald Xu <ronaldxu@amazon.com>

ZhanqiuHu force-pushed the fix/mamba-pd-n1-prefill branch from d5e8a21 to 08f1c47 Compare March 18, 2026 16:31

hmellor and others added 16 commits March 18, 2026 17:19

Add API docs link if the CLI arg is a config class (vllm-project#37432)

39bfb57

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec (…

5dd8df0

…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>

[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E through…

0ef7f79

…put improvement (vllm-project#37340) Signed-off-by: yewentao256 <zhyanwentao@126.com>

[Misc] Clean up model registry (vllm-project#37457)

f3732bd

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Model] Remove unnecessary processor definition for Nemotron Parse (v…

7476d14

…llm-project#37456) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[bugfix][async scheduling] fix extra cuda context in device 0 with EP…

70b81c4

…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp …

738d0a2

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

Fix models which use layer_type_validation for Transformers v5 (vll…

5ce2d10

…m-project#37398) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Fix ruff SIM102: collapse nested if statements

f1c4cb6

Signed-off-by: ZhanqiuHu <zhu@redhat.com>

Rename _hma_ helpers to _mamba_ for clarity

3a18e8e

Signed-off-by: ZhanqiuHu <zhu@redhat.com>

Add comment explaining _p_side_truncated preemption guard

f29f1b6

Signed-off-by: ZhanqiuHu <zhu@redhat.com>

add test cases

458ca60

Signed-off-by: ZhanqiuHu <zhu@redhat.com>

handle prompt embeddings

17f996f

Signed-off-by: ZhanqiuHu <zhu@redhat.com>

ZhanqiuHu force-pushed the fix/mamba-pd-n1-prefill branch from 9ffbd09 to 17f996f Compare March 18, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SSM] Follow-up fix for Mamba P/D KV transfer#4

[SSM] Follow-up fix for Mamba P/D KV transfer#4
ZhanqiuHu wants to merge 42 commits intomainfrom
fix/mamba-pd-n1-prefill

ZhanqiuHu commented Mar 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ZhanqiuHu commented Mar 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ZhanqiuHu commented Mar 17, 2026 •

edited by github-actions bot

Loading