[NIXL] Support P tensor-parallel-size > D tensor-parallel-size by NickLucche · Pull Request #27274 · vllm-project/vllm

NickLucche · 2025-10-21T17:04:09Z

Overview

This PR addresses the following case, P tensor-parallel-size > D tensor-parallel-size.

I think it helps to differentiate two main cases

MLA

For MLA model, the workflow is easier: each D worker reads from some other single P worker (fan-out reads to avoid all reading from same remote), as MLA cache is duplicated. Some P workers will not be read from at all.
Mind that this also holds for the DP/EP deployment, where TP size on D will often be 1!

From PR #23917, which also serves as good use-case. Btw as explained in that PR, the number of requests to "expect" is indeed the number of remote instances reading from P.

The main issue to implement that in Nixl is that each P worker will track requests as they come in (_reqs_to_send, _reqs_to_process) and those structures are only cleared properly when a read is detected (o/w timeouts would be raised on P).
To address that, I am allowing MLA D ranks to only execute one transfer, but notifying all affected remote that the read is completed (sending multiple nixl notifs).

cc @njhill @markmc

Dense

For dense models, every D worker will read from n P workers to re-compose its own KV cache, where n is referred to as tp_ratio in code.

This is possible because number of heads on P is H/n that of D's, so you can efficiently read into D's cache using HND layout. That is, in memory, you're just laying out flat ND tensors H/n , n times

Side note: current design is flexible and allows for dynamic discovery of remotes with different tp_sizes. However this is not a feature that is currently supported, but it helps to take into account when considering impl choices. It's more of an optional route I'd like to keep open.

Changes

The main change this PR needs to allow is for a D worker to read from multiple P's.
Practical edits this PR introduces to do so:

src_xfer_side_chunked_handles: local regions need to be split differently based on how many remotes we want to read from. This is prepared during handshake, once .
a few structures go from single remote to [engine_id][rank_no] to accomodate the above
get_target_remote->get_target_remotes for the same reason, + a bunch of for loops over its result
P has to wait for at most a single read notification (communicated from D)
tp_ratio extension to indicate remote P size greater than D
multiple xfers/handles per request: this was partly already supported, I just fixed a bug in _pop_done_transfers
multiple notifs-single read to optimize for MLA models

How to test

pytest -v -s -x tests/v1/kv_connector/unit/test_nixl_connector.py::TestNixlHandshake::test_prefill_tp_size_greater_than_decode_tp_size/test_prefill_tp_size_greater_than_decode_tp_size_mla

And check out tp_config_sweep_accuracy with config:

PREFILLER_TP_SIZE=2 DECODER_TP_SIZE=1 bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

TODO

Coming soon to this PR:

Avoid stranding requests on P for MLA
~~[ ] On MLA with DP/EP, avoid having all workers read from same remote~~ deferring
DP_EP tests
It does NOT support replicated KV heads scenario, tp_size>num_heads. This is definitely doable, just I believe on weak demand atm so we can postpone it.

NickLucche · 2025-10-21T17:06:59Z

cc @GuanLuo let me know if this PR meets the expected set of features you aimed to get with your work. Thank you!

tests/v1/kv_connector/unit/test_nixl_connector.py

NickLucche · 2025-10-21T17:08:58Z

tests/v1/kv_connector/unit/test_output_aggregator.py

@@ -16,11 +16,13 @@ def __init__(
        finished_sending: set[str] | None = None,


ignore file, to be rebased once #26734 lands

NickLucche · 2025-10-21T17:09:05Z

vllm/distributed/kv_transfer/kv_connector/v1/base.py

@@ -413,7 +413,8 @@ def get_required_kvcache_layout(cls, vllm_config: "VllmConfig") -> str | None:
    def get_finished_count(self) -> int | None:
        """
        Get the count of requests expected to complete send/receive operations
-        via this connector.
+        via this connector. This method is used to initialize the


ignore, to be rebased once #26734 lands

NickLucche · 2025-10-21T17:09:54Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            tp_ratio,
+        )
+
+        ### (Optional) Register local agent memory regions. MLA is not split.


gist of the PR

NickLucche · 2025-10-21T17:14:06Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

@@ -1593,16 +1712,14 @@ def _read_blocks(

        # Number of D TP workers that will read from dst P. Propagate tp_ratio
        # on notification so that dst worker can wait before freeing blocks.
-        tp_ratio = self.kv_topo.tp_ratio_from_engine_id(dst_engine_id)
+        # Cap to 1 when P TP > D TP: only a single rank will read from remote.
+        tp_ratio = max(1, self.kv_topo.tp_ratio_from_engine_id(dst_engine_id))


this is to have P only wait for 1 request instead of -tp_ratio

NickLucche · 2025-10-21T17:14:27Z

vllm/distributed/kv_transfer/kv_connector/utils.py

@@ -4,10 +4,9 @@
 KV cache helper for store.


ignore file, to be rebased once #26734 lands

NickLucche · 2025-10-21T17:14:51Z

vllm/executor/executor_base.py

@@ -6,7 +6,7 @@
 from abc import ABC, abstractmethod


ignore file, to be rebased once #26734 lands

NickLucche · 2025-10-21T17:14:56Z

vllm/v1/engine/core.py

@@ -160,9 +160,7 @@ def __init__(
        )


ignore, to be rebased once #26734 lands

NickLucche · 2025-10-21T17:15:00Z

vllm/v1/outputs.py

@@ -86,8 +86,14 @@ class KVConnectorOutput:
    finished_recving: set[str] | None = None


ignore, to be rebased once #26734 lands

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify · 2025-10-22T17:19:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-10-23T16:01:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-10-23T16:36:21Z

PR's now ready for review!

NickLucche · 2025-10-25T13:42:53Z

cc @xuechendi for xpu

mergify · 2025-10-27T14:39:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

xuechendi · 2025-10-27T14:52:53Z

@zhenwei-intel , please help to review, thx

mergify · 2025-11-14T19:12:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-11-21T14:08:17Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py


        # Number of NIXL regions. Currently one region per cache
        # (so 1 per layer for MLA, otherwise 2 per layer)
        self.num_regions = 0
        self.num_layers = 0

        # nixl_prepped_dlist_handle.
-        self.src_xfer_side_handle: int = 0


dropped default self.src_xfer_side_handle in favor of
self.src_xfer_handles_by_block_size[self.block_size]

NickLucche · 2025-11-21T14:09:25Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            if self.use_mla and tp_ratio < 0:
+                # ..but we still need to notify the other remote ranks that we
+                # have the blocks we need so they can update the request state.


important mla logic

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

xuechendi · 2025-11-21T18:02:04Z

PR is verified with heter_block_size test, and it looks good.
Thanks

njhill

I took a first pass, looks very cleanly done, really awesome work @NickLucche!

Just a few style suggestions, I have not gone through all of the logic in detail yet, will try to spend a bit more time on that but it looks pretty solid!

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

xinyu-intel

verified with vllm-gaudi

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

NickLucche · 2025-12-12T18:23:25Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

@@ -1857,7 +1937,7 @@ def _pop_done_transfers(self, transfers: dict[str, list[int]]) -> set[str]:
        """
        done_req_ids: set[str] = set()
        for req_id, handles in list(transfers.items()):
-            in_progress = False
+            in_progress = []


prev logic was broken for multiple transfers.

NickLucche · 2025-12-12T18:24:42Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

-            self.src_xfer_side_handle = 0
-        for dst_xfer_side_handle in self.dst_xfer_side_handles.values():
-            self.nixl_wrapper.release_dlist_handle(dst_xfer_side_handle)
+        for handle in self.src_xfer_handles_by_block_size.values():


just matching handles structures changes

mergify · 2025-12-16T14:29:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: NickLucche <nlucches@redhat.com>

more MLA tests Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>

mergify bot added v1 kv-connector labels Oct 21, 2025

NickLucche commented Oct 21, 2025

View reviewed changes

andylolu2 reviewed Oct 22, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Oct 22, 2025

NickLucche force-pushed the nixl-ptp-gt-dtp branch from 054e7ff to 77577b0 Compare October 23, 2025 08:23

mergify bot removed the needs-rebase label Oct 23, 2025

mergify bot added the needs-rebase label Oct 23, 2025

NickLucche force-pushed the nixl-ptp-gt-dtp branch from 54cf766 to 78ef532 Compare October 23, 2025 16:35

NickLucche marked this pull request as ready for review October 23, 2025 16:35

NickLucche requested a review from ApostaC as a code owner October 23, 2025 16:35

mergify bot removed the needs-rebase label Oct 23, 2025

NickLucche requested a review from andylolu2 October 24, 2025 08:51

mergify bot added the needs-rebase label Oct 27, 2025

xuechendi reviewed Oct 27, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

tlrmchlsmth mentioned this pull request Oct 27, 2025

[Bug]: PD example TP "imballance" crash llm-d/llm-d#394

Closed

NickLucche force-pushed the nixl-ptp-gt-dtp branch from 78ef532 to 2d08fa7 Compare November 7, 2025 12:55

mergify bot removed the needs-rebase label Nov 7, 2025

NickLucche force-pushed the nixl-ptp-gt-dtp branch from 2d08fa7 to 4feab2c Compare November 7, 2025 15:09

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 9, 2025

mergify bot added the needs-rebase label Nov 14, 2025

NickLucche force-pushed the nixl-ptp-gt-dtp branch from 4feab2c to 3899d23 Compare November 21, 2025 13:51

mergify bot removed the needs-rebase label Nov 21, 2025

NickLucche commented Nov 21, 2025

View reviewed changes

NickLucche requested a review from xuechendi November 21, 2025 14:22

xuechendi reviewed Nov 21, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

xuechendi approved these changes Nov 21, 2025

View reviewed changes

njhill reviewed Nov 25, 2025

View reviewed changes

xinyu-intel approved these changes Dec 9, 2025

View reviewed changes

NickLucche force-pushed the nixl-ptp-gt-dtp branch from 3899d23 to e8dd92c Compare December 12, 2025 18:19

NickLucche commented Dec 12, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 16, 2025

NickLucche added 4 commits December 17, 2025 09:02

init

4225f12

Signed-off-by: NickLucche <nlucches@redhat.com>

address block freeing for MLA

b435a39

more MLA tests Signed-off-by: NickLucche <nlucches@redhat.com>

intel review

b74bedf

Signed-off-by: NickLucche <nlucches@redhat.com>

nicks review

0fd8705

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the nixl-ptp-gt-dtp branch from 25816c5 to 0fd8705 Compare December 17, 2025 09:06

mergify bot removed the needs-rebase label Dec 17, 2025

Merge branch 'main' into nixl-ptp-gt-dtp

71f70a9

DarkLight1337 approved these changes Dec 18, 2025

View reviewed changes

DarkLight1337 merged commit bc3700e into vllm-project:main Dec 18, 2025
51 checks passed

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

[NIXL] Support P tensor-parallel-size > D tensor-parallel-size (vllm-…

5b1da45

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[NIXL] Support P tensor-parallel-size > D tensor-parallel-size (vllm-…

6a29076

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[NIXL] Support P tensor-parallel-size > D tensor-parallel-size (vllm-…

832307d

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

NickLucche mentioned this pull request Feb 3, 2026

[Roadmap]: PD Disaggregation with NixlConnector Roadmap #33702

Open

44 tasks

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[NIXL] Support P tensor-parallel-size > D tensor-parallel-size (vllm-…

ed1da35

…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>

		@@ -16,11 +16,13 @@ def __init__(
		finished_sending: set[str] \| None = None,

		@@ -86,8 +86,14 @@ class KVConnectorOutput:
		finished_recving: set[str] \| None = None

Uh oh!

Conversation

NickLucche commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

MLA

Dense

Changes

How to test

TODO

Uh oh!

NickLucche commented Oct 21, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

mergify bot commented Oct 23, 2025

Uh oh!

NickLucche commented Oct 23, 2025

Uh oh!

NickLucche commented Oct 25, 2025

Uh oh!

mergify bot commented Oct 27, 2025

Uh oh!

Uh oh!

xuechendi commented Oct 27, 2025

Uh oh!

mergify bot commented Nov 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuechendi commented Nov 21, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinyu-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

NickLucche commented Oct 21, 2025 •

edited by github-actions bot

Loading