Skip to content

[KV Connector][3/N][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP#43368

Draft
zixi-qi wants to merge 5 commits into
vllm-project:mainfrom
zixi-qi:pr2/pp-disagg-nixl-hma
Draft

[KV Connector][3/N][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP#43368
zixi-qi wants to merge 5 commits into
vllm-project:mainfrom
zixi-qi:pr2/pp-disagg-nixl-hma

Conversation

@zixi-qi

@zixi-qi zixi-qi commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Extends NIXL PD-disaggregated serving to hybrid (Mamba/SSM) models under
pipeline parallelism. PR #43366 lands the PP consumer per-shard refactor but
explicitly rejects hybrid producers when pp_size > 1 because the per-shard
descriptor builder doesn't yet carry Mamba region state. This PR adds the
missing Mamba/SSM bookkeeping so hybrid models (Jamba-style, Mamba-based,
etc.) work end-to-end on heterogeneous PP × TP topologies.

Stacked on #43366. While #43366 is open, this PR's diff shows the
combined changes from both PRs. Once #43366 merges, this PR will rebase
down to the 2-file delta described below.

Net delta (this PR only, on top of #43366)

tests/v1/kv_connector/nixl_integration/test_hma_pp_per_layer_regions.py | 163 +++++
vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py             |  99 ++--
2 files changed, 241 insertions(+), 21 deletions(-)

Changes

_ShardDescLayout grows two fields:

  • mamba_region_count: int = 0
  • mamba_region_group_ids: tuple[int, ...] = ()

_register_local_xfer_handler_for_shard builds local Mamba descriptors
when self._has_mamba is set, computes mamba_region_group_ids (each KV-group
id replicated 4× for Mamba's 4 SSM regions per layer), and embeds the result
into the per-shard layout.

add_remote_agent registers remote Mamba blocks per shard via
_build_mamba_remote(nixl_agent_meta, tp_ratio, transfer_info) and emits a
remote _ShardDescLayout carrying mamba_region_count /
mamba_region_group_ids for the consumer's transfer descriptor table.

_get_block_descs_ids_for_shard routes Mamba shards through a logical-
block-aware path: FA regions use layout.num_blocks, Mamba regions use
layout.num_blocks // physical_blocks_per_logical with an offset of
num_fa_descs. The non-Mamba path is unchanged.

The register_kv_caches rejection guard added in #43366 is dropped — hybrid
producers are now supported with pp_size > 1.

Test Plan

Unit test

.venv/bin/python -m pytest -v \
  tests/v1/kv_connector/nixl_integration/test_hma_pp_per_layer_regions.py

Covers the Mamba region group construction path: validates
mamba_region_count, mamba_region_group_ids, descriptor ID offset math,
and the FA-vs-Mamba grouping in _ShardDescLayout.

End-to-end (not yet validated on this PR)

We have not run an E2E hybrid model on PP × PD yet on the GB200 rig
this branch was developed on — the rig doesn't have a Mamba-based hybrid
model loaded. The unit test covers the per-shard descriptor construction
logic, which is the part PR #43366's rejection guard explicitly defers.
E2E validation on Jamba (or a similar hybrid) would need:

# Same layout as #43366's E2E setup, but the prefiller hosts a hybrid model
CUDA_VISIBLE_DEVICES=0,1 UCX_TLS=tcp,cuda_copy \
  VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
  vllm serve <hybrid-model> \
  --pipeline-parallel-size 2 --tensor-parallel-size 1 \
  --max-model-len 32768 --port 8100 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

CUDA_VISIBLE_DEVICES=2,3 UCX_TLS=tcp,cuda_copy \
  VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
  vllm serve <hybrid-model> \
  --pipeline-parallel-size 1 --tensor-parallel-size 2 \
  --max-model-len 32768 --port 8200 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Happy to keep this PR in draft until I (or a reviewer with access to a
Mamba-capable rig) runs that smoke test. Marking it draft for now.

Lint

pre-commit run --files \
  tests/v1/kv_connector/nixl_integration/test_hma_pp_per_layer_regions.py \
  vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py
pre-commit run mypy-3.10 --hook-stage manual --files \
  tests/v1/kv_connector/nixl_integration/test_hma_pp_per_layer_regions.py \
  vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py

All hooks: Passed.

Test Result

  • Unit test added in this PR passes locally.
  • All targeted pre-commit hooks pass on the changed files.
  • E2E hybrid run: pending (see Test Plan).

Why this is not a duplicate

Searched vLLM open PRs/issues on 2026-05-21 for HMA pipeline parallel,
Mamba disaggregated NIXL, hybrid PP P/D. No open work targets the
Mamba × NIXL PD path under pipeline parallelism. The HMA × P/D paths that
do exist (e.g. test_nixl_connector_hma.py upstream) cover non-PP
topologies only.

AI assistance disclosure

This change was drafted with AI assistance (Claude Code, Opus 4.7). The
submitting human reviewed every changed line and ran the unit test
referenced above. This PR is the deliberate HMA × PP follow-up referenced
in PR #43366's description.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR (see Purpose section).
  • The test plan (see Test Plan section).
  • The test results (unit test passes; E2E pending).
  • (Optional) The necessary documentation update.
  • (Optional) Release notes update.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for pipeline parallelism (PP) in the NIXL KV transfer connector, updating the handshake protocol to version 5 and refactoring the worker and topology logic to manage transfers per PP shard. It also addresses a bug in the model runner where returning None instead of an empty output interfered with output aggregation. A critical issue was identified in the descriptor ID calculation for interleaved memory layouts, such as those used by FlashInfer, which would lead to corrupted KV transfers; a code suggestion was provided to correctly handle indexing for both interleaved and standard layouts.

group_arr = np.asarray(block_ids[group_id], dtype=np.int64)
if group_arr.size == 0:
continue
desc_ids.append(region_id * num_blocks + group_arr + offset)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The descriptor ID calculation region_id * num_blocks + group_arr assumes a non-interleaved layout in the dlist (i.e., all blocks for region 0, then all blocks for region 1). However, for backends where is_kv_layout_blocks_first is true (like FlashInfer), the registration logic in _build_fa_local and _build_fa_remote produces an interleaved layout [K0, V0, K1, V1, ...]. Using the current formula with an interleaved layout will result in incorrect descriptor indexing, leading to corrupted KV transfers. For interleaved layouts, the index for block i of region r (where r is 0 for K and 1 for V of the same layer) should be group_arr * 2 + (region_id % 2) relative to the layer's start offset.

Suggested change
desc_ids.append(region_id * num_blocks + group_arr + offset)
if not include_mamba and self.transfer_topo.is_kv_layout_blocks_first:
# Interleaved layout: [K0, V0, K1, V1, ...]
desc_ids.append((region_id // 2) * (2 * num_blocks) +
group_arr * 2 + (region_id % 2) + offset)
else:
# Standard layout: [R0_B0, R0_B1, ..., R1_B0, R1_B1, ...]
desc_ids.append(region_id * num_blocks + group_arr + offset)

@zixi-qi zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch 2 times, most recently from 73b732b to 4fa8b01 Compare May 21, 2026 22:25
@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 27, 2026
@zixi-qi zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch from 4fa8b01 to 4fe81c1 Compare May 28, 2026 15:08
@mergify mergify Bot removed the needs-rebase label May 28, 2026
@zixi-qi zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch from 4fe81c1 to b7c267c Compare May 28, 2026 17:12
@mergify

mergify Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 29, 2026
zixi-qi added 2 commits May 29, 2026 06:16
…iate-PP output plumbing

Co-authored-by: Claude
Signed-off-by: zixi-qi <zixi@inferact.ai>
…superseded by vllm-project#43732

Co-authored-by: Claude
Signed-off-by: zixi-qi <zixi@inferact.ai>
@zixi-qi zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch from b7c267c to 9e9e904 Compare May 29, 2026 06:43
@mergify mergify Bot removed the needs-rebase label May 29, 2026
zixi-qi added 3 commits May 30, 2026 22:24
… base

Co-authored-by: Claude
Signed-off-by: zixi-qi <zixi@inferact.ai>
…, no HMA)

Signed-off-by: zixi-qi <zixi@inferact.ai>
…r PP

Signed-off-by: zixi-qi <zixi@inferact.ai>
@zixi-qi zixi-qi force-pushed the pr2/pp-disagg-nixl-hma branch from 9e9e904 to 3200172 Compare May 30, 2026 22:35
@zixi-qi zixi-qi changed the title [KV Connector][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP [KV Connector][3/N][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP May 30, 2026
@mergify

mergify Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant