[KV Connector][3/N][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP by zixi-qi · Pull Request #43368 · vllm-project/vllm

zixi-qi · 2026-05-21T21:56:15Z

Purpose

Extends NIXL PD-disaggregated serving to hybrid (Mamba/SSM) models under
pipeline parallelism. PR #43366 lands the PP consumer per-shard refactor but
explicitly rejects hybrid producers when pp_size > 1 because the per-shard
descriptor builder doesn't yet carry Mamba region state. This PR adds the
missing Mamba/SSM bookkeeping so hybrid models (Jamba-style, Mamba-based,
etc.) work end-to-end on heterogeneous PP × TP topologies.

Stacked on #43366. While #43366 is open, this PR's diff shows the
combined changes from both PRs. Once #43366 merges, this PR will rebase
down to the 2-file delta described below.

Net delta (this PR only, on top of #43366)

tests/v1/kv_connector/nixl_integration/test_hma_pp_per_layer_regions.py | 163 +++++
vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py             |  99 ++--
2 files changed, 241 insertions(+), 21 deletions(-)

Changes

_ShardDescLayout grows two fields:

mamba_region_count: int = 0
mamba_region_group_ids: tuple[int, ...] = ()

_register_local_xfer_handler_for_shard builds local Mamba descriptors
when self._has_mamba is set, computes mamba_region_group_ids (each KV-group
id replicated 4× for Mamba's 4 SSM regions per layer), and embeds the result
into the per-shard layout.

add_remote_agent registers remote Mamba blocks per shard via
_build_mamba_remote(nixl_agent_meta, tp_ratio, transfer_info) and emits a
remote _ShardDescLayout carrying mamba_region_count /
mamba_region_group_ids for the consumer's transfer descriptor table.

_get_block_descs_ids_for_shard routes Mamba shards through a logical-
block-aware path: FA regions use layout.num_blocks, Mamba regions use
layout.num_blocks // physical_blocks_per_logical with an offset of
num_fa_descs. The non-Mamba path is unchanged.

The register_kv_caches rejection guard added in #43366 is dropped — hybrid
producers are now supported with pp_size > 1.

Test Plan

Unit test

.venv/bin/python -m pytest -v \
  tests/v1/kv_connector/nixl_integration/test_hma_pp_per_layer_regions.py

Covers the Mamba region group construction path: validates
mamba_region_count, mamba_region_group_ids, descriptor ID offset math,
and the FA-vs-Mamba grouping in _ShardDescLayout.

End-to-end (not yet validated on this PR)

We have not run an E2E hybrid model on PP × PD yet on the GB200 rig
this branch was developed on — the rig doesn't have a Mamba-based hybrid
model loaded. The unit test covers the per-shard descriptor construction
logic, which is the part PR #43366's rejection guard explicitly defers.
E2E validation on Jamba (or a similar hybrid) would need:

# Same layout as #43366's E2E setup, but the prefiller hosts a hybrid model
CUDA_VISIBLE_DEVICES=0,1 UCX_TLS=tcp,cuda_copy \
  VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
  vllm serve <hybrid-model> \
  --pipeline-parallel-size 2 --tensor-parallel-size 1 \
  --max-model-len 32768 --port 8100 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

CUDA_VISIBLE_DEVICES=2,3 UCX_TLS=tcp,cuda_copy \
  VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
  vllm serve <hybrid-model> \
  --pipeline-parallel-size 1 --tensor-parallel-size 2 \
  --max-model-len 32768 --port 8200 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Happy to keep this PR in draft until I (or a reviewer with access to a
Mamba-capable rig) runs that smoke test. Marking it draft for now.

Lint

pre-commit run --files \
  tests/v1/kv_connector/nixl_integration/test_hma_pp_per_layer_regions.py \
  vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py
pre-commit run mypy-3.10 --hook-stage manual --files \
  tests/v1/kv_connector/nixl_integration/test_hma_pp_per_layer_regions.py \
  vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py

All hooks: Passed.

Test Result

Unit test added in this PR passes locally.
All targeted pre-commit hooks pass on the changed files.
E2E hybrid run: pending (see Test Plan).

Why this is not a duplicate

Searched vLLM open PRs/issues on 2026-05-21 for HMA pipeline parallel,
Mamba disaggregated NIXL, hybrid PP P/D. No open work targets the
Mamba × NIXL PD path under pipeline parallelism. The HMA × P/D paths that
do exist (e.g. test_nixl_connector_hma.py upstream) cover non-PP
topologies only.

AI assistance disclosure

This change was drafted with AI assistance (Claude Code, Opus 4.7). The
submitting human reviewed every changed line and ran the unit test
referenced above. This PR is the deliberate HMA × PP follow-up referenced
in PR #43366's description.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR (see Purpose section).
The test plan (see Test Plan section).
The test results (unit test passes; E2E pending).
(Optional) The necessary documentation update.
(Optional) Release notes update.

gemini-code-assist

Code Review

This pull request introduces support for pipeline parallelism (PP) in the NIXL KV transfer connector, updating the handshake protocol to version 5 and refactoring the worker and topology logic to manage transfers per PP shard. It also addresses a bug in the model runner where returning None instead of an empty output interfered with output aggregation. A critical issue was identified in the descriptor ID calculation for interleaved memory layouts, such as those used by FlashInfer, which would lead to corrupted KV transfers; a code suggestion was provided to correctly handle indexing for both interleaved and standard layouts.

gemini-code-assist · 2026-05-21T21:59:19Z

+                group_arr = np.asarray(block_ids[group_id], dtype=np.int64)
+                if group_arr.size == 0:
+                    continue
+                desc_ids.append(region_id * num_blocks + group_arr + offset)


The descriptor ID calculation region_id * num_blocks + group_arr assumes a non-interleaved layout in the dlist (i.e., all blocks for region 0, then all blocks for region 1). However, for backends where is_kv_layout_blocks_first is true (like FlashInfer), the registration logic in _build_fa_local and _build_fa_remote produces an interleaved layout [K0, V0, K1, V1, ...]. Using the current formula with an interleaved layout will result in incorrect descriptor indexing, leading to corrupted KV transfers. For interleaved layouts, the index for block i of region r (where r is 0 for K and 1 for V of the same layer) should be group_arr * 2 + (region_id % 2) relative to the layer's start offset.

Suggested change

desc_ids.append(region_id * num_blocks + group_arr + offset)

if not include_mamba and self.transfer_topo.is_kv_layout_blocks_first:

# Interleaved layout: [K0, V0, K1, V1, ...]

desc_ids.append((region_id // 2) * (2 * num_blocks) +

group_arr * 2 + (region_id % 2) + offset)

else:

# Standard layout: [R0_B0, R0_B1, ..., R1_B0, R1_B1, ...]

desc_ids.append(region_id * num_blocks + group_arr + offset)

mergify · 2026-05-27T22:53:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-29T04:07:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…iate-PP output plumbing Co-authored-by: Claude Signed-off-by: zixi-qi <zixi@inferact.ai>

…superseded by vllm-project#43732 Co-authored-by: Claude Signed-off-by: zixi-qi <zixi@inferact.ai>

… base Co-authored-by: Claude Signed-off-by: zixi-qi <zixi@inferact.ai>

…, no HMA) Signed-off-by: zixi-qi <zixi@inferact.ai>

…r PP Signed-off-by: zixi-qi <zixi@inferact.ai>

mergify · 2026-06-05T05:20:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

zixi-qi mentioned this pull request May 21, 2026

[KV Connector][2/N][NIXL] Pipeline-parallel support for PD-disaggregated serving with NIXL connector #43366

Draft

5 tasks

mergify Bot added v1 kv-connector labels May 21, 2026

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch 2 times, most recently from 73b732b to 4fa8b01 Compare May 21, 2026 22:25

mergify Bot added the needs-rebase label May 27, 2026

zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch from 4fa8b01 to 4fe81c1 Compare May 28, 2026 15:08

mergify Bot removed the needs-rebase label May 28, 2026

zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch from 4fe81c1 to b7c267c Compare May 28, 2026 17:12

mergify Bot added the needs-rebase label May 29, 2026

zixi-qi added 2 commits May 29, 2026 06:16

[KVConnector] Foundation: PP-aware handshake aggregation and intermed…

82629e1

…iate-PP output plumbing Co-authored-by: Claude Signed-off-by: zixi-qi <zixi@inferact.ai>

[KVConnector] Rebase onto main: drop intermediate-PP output plumbing …

c1336b2

…superseded by vllm-project#43732 Co-authored-by: Claude Signed-off-by: zixi-qi <zixi@inferact.ai>

zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch from b7c267c to 9e9e904 Compare May 29, 2026 06:43

mergify Bot removed the needs-rebase label May 29, 2026

zixi-qi added 3 commits May 30, 2026 22:24

[KVConnector] Remove SupportsPP marker; default pp-aware handshake in…

29491ee

… base Co-authored-by: Claude Signed-off-by: zixi-qi <zixi@inferact.ai>

[KVConnector][NIXL] Enable PP-disaggregated KV transfer (single-group…

b2da772

…, no HMA) Signed-off-by: zixi-qi <zixi@inferact.ai>

[KVConnector][NIXL] Per-layer-name HMA routing for hybrid models unde…

3200172

…r PP Signed-off-by: zixi-qi <zixi@inferact.ai>

zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch from 9e9e904 to 3200172 Compare May 30, 2026 22:35

zixi-qi changed the title ~~[KV Connector][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP~~ [KV Connector][3/N][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP May 30, 2026

Dao007forever mentioned this pull request Jun 4, 2026

[NIXL] Per-region KV transfer classification for mixed full-attn + MLA groups #44583

Merged

mergify Bot added the needs-rebase label Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KV Connector][3/N][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP#43368

[KV Connector][3/N][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP#43368
zixi-qi wants to merge 5 commits into
vllm-project:mainfrom
zixi-qi:pr2/pp-disagg-nixl-hma

zixi-qi commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

mergify Bot commented May 27, 2026

Uh oh!

mergify Bot commented May 29, 2026

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-                desc_ids.append(region_id * num_blocks + group_arr + offset)
+                if not include_mamba and self.transfer_topo.is_kv_layout_blocks_first:
+                    # Interleaved layout: [K0, V0, K1, V1, ...]
+                    desc_ids.append((region_id // 2) * (2 * num_blocks) +
+                                    group_arr * 2 + (region_id % 2) + offset)
+                else:
+                    # Standard layout: [R0_B0, R0_B1, ..., R1_B0, R1_B1, ...]
+                    desc_ids.append(region_id * num_blocks + group_arr + offset)

Uh oh!

Conversation

zixi-qi commented May 21, 2026

Purpose

Net delta (this PR only, on top of #43366)

Changes

Test Plan

Unit test

End-to-end (not yet validated on this PR)

Lint

Test Result

Why this is not a duplicate

AI assistance disclosure

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 27, 2026

Uh oh!

mergify Bot commented May 29, 2026

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant