[P/D Disagg] [1/N] Support Homogeneous TP > 1 #65

robertgshaw2-redhat · 2025-05-03T18:15:59Z

SUMMARY:

cleanup NixlConnector apis and add better typing / comments
enable each rank to exchange metadata with its cousin
rank 0 keeps track of which ranks have finishes txns

TODO (longer term):

add support for heterogeneous TP
use etcd rather than side channel for meta xchange

Signed-off-by: ApostaC <[email protected]>

Signed-off-by: Tyler Michael Smith <[email protected]>

…mcache

Signed-off-by: Tyler Michael Smith <[email protected]>

* updated Signed-off-by: [email protected] <[email protected]>

Signed-off-by: [email protected] <[email protected]>

* updated Signed-off-by: [email protected] <[email protected]> * stash Signed-off-by: [email protected] <[email protected]> * cleanup Signed-off-by: [email protected] <[email protected]> --------- Signed-off-by: [email protected] <[email protected]>

- fix spelling - CUDA_VISIBLE_DEVICES should be set externally Signed-off-by: Tyler Michael Smith <[email protected]>

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat · 2025-05-03T19:00:58Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

 from vllm.config import VllmConfig
 from vllm.distributed.kv_transfer.kv_connector.v1.base import (
    KVConnectorBase_V1, KVConnectorMetadata, KVConnectorRole)
+from vllm.distributed.parallel_state import get_tensor_model_parallel_rank


in follow up: do this by passing the rank to the Connector directly rather than grabbing it here.

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat · 2025-05-04T15:47:09Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            other_ranks_finished_ids: list[str] = []
+            for i in range(1, self.world_size):
+                other_ranks_finished_ids.extend(
+                    self.tp_group.recv_object(src=i))


this is how Dyanmo does it (with the tp_group)

I wonder if there is a better way cc @njhill

@robertgshaw2-redhat here is an alternative to consider robertgshaw2-redhat#7

Guess this might be preferable latency wise since we don't have additional gather collective, but not sure (since now scheduler needs to receive from all ranks .. though it was doing this anyhow until recently).

lets just time things and see which one is faster

Looks like for TP=2, the setup I have is taking <1ms, so I think this is good enough for now as I would prefer to keep the changes in this file if possible

Signed-off-by: [email protected] <[email protected]>

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Co-authored-by: Tyler Michael Smith <[email protected]>

ApostaC and others added 30 commits April 17, 2025 13:56

[Update] LMcache connector v1 implementation

4730522

Signed-off-by: ApostaC <[email protected]>

[Add] examples for disaggregated prefill

4162650

Signed-off-by: ApostaC <[email protected]>

[add] extra information about evns

3ccd34c

Signed-off-by: ApostaC <[email protected]>

Initial stubs for P/D scheduling changes

161010c

Signed-off-by: Tyler Michael Smith <[email protected]>

Merge branch 'main' into local-dev/lmcache-v1-connector-pr

38a2eb8

Merge branch 'local-dev/lmcache-v1-connector-pr' into pd_scheduling_l…

6c3191f

…mcache

Updates

1f708e9

Signed-off-by: Tyler Michael Smith <[email protected]>

Rs branch (#3)

038f2f8

* updated Signed-off-by: [email protected] <[email protected]>

Rs branch (#5)

5c4fc6f

Signed-off-by: [email protected] <[email protected]>

Remove Unneeded Arguments (#7)

1800689

* updated Signed-off-by: [email protected] <[email protected]> * stash Signed-off-by: [email protected] <[email protected]> * cleanup Signed-off-by: [email protected] <[email protected]> --------- Signed-off-by: [email protected] <[email protected]>

Improve disagg-example.sh (#8)

7a1f25f

- fix spelling - CUDA_VISIBLE_DEVICES should be set externally Signed-off-by: Tyler Michael Smith <[email protected]>

updated

2385d8e

Signed-off-by: [email protected] <[email protected]>

updated

6eeb47c

Signed-off-by: [email protected] <[email protected]>

updated

266fcee

Signed-off-by: [email protected] <[email protected]>

updated

f7e16f1

Signed-off-by: [email protected] <[email protected]>

added connector

f591b8e

Signed-off-by: [email protected] <[email protected]>

updated

184d0b6

Signed-off-by: [email protected] <[email protected]>

updated

d4a9e5b

Signed-off-by: [email protected] <[email protected]>

updated

4b0d1dc

Signed-off-by: [email protected] <[email protected]>

updated

bfef039

Signed-off-by: [email protected] <[email protected]>

updated

54f4a43

Signed-off-by: [email protected] <[email protected]>

updated

e604b09

Signed-off-by: [email protected] <[email protected]>

updated

2fc00ad

Signed-off-by: [email protected] <[email protected]>

updated

e5967b6

Signed-off-by: [email protected] <[email protected]>

updated

f1bc0f7

Signed-off-by: [email protected] <[email protected]>

updated

1cea2bb

Signed-off-by: [email protected] <[email protected]>

updated

489e4c0

Signed-off-by: [email protected] <[email protected]>

updated

437ac91

Signed-off-by: [email protected] <[email protected]>

updated

ea47af7

Signed-off-by: [email protected] <[email protected]>

updated

554b27d

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat added 12 commits May 3, 2025 18:39

updated

4fe1829

Signed-off-by: [email protected] <[email protected]>

Merge remote-tracking branch 'nm-fork/disagg_pd_dev' into tp-gt-1

d6b2531

Signed-off-by: [email protected] <[email protected]>

revert

1bbd623

Signed-off-by: [email protected] <[email protected]>

more spurious changes

afdcd2f

Signed-off-by: [email protected] <[email protected]>

updated

6790c00

Signed-off-by: [email protected] <[email protected]>

updated

87277d6

Signed-off-by: [email protected] <[email protected]>

updated

8ff421e

Signed-off-by: [email protected] <[email protected]>

updated

79af352

Signed-off-by: [email protected] <[email protected]>

updated

e21f5f9

Signed-off-by: [email protected] <[email protected]>

updated

39fee21

Signed-off-by: [email protected] <[email protected]>

updated

99a5afd

Signed-off-by: [email protected] <[email protected]>

updated

1e0db0b

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat changed the title ~~[P/D] Support TP g.t. 1~~ [P/D Disagg] [1/N] Support Homogeneous TP > 1 May 3, 2025

robertgshaw2-redhat commented May 3, 2025

View reviewed changes

robertgshaw2-redhat added 9 commits May 3, 2025 19:25

updated

9bdbe38

Signed-off-by: [email protected] <[email protected]>

updated

911e480

Signed-off-by: [email protected] <[email protected]>

updated

357bd03

Signed-off-by: [email protected] <[email protected]>

updated

93a32eb

Signed-off-by: [email protected] <[email protected]>

updated

181d68d

Signed-off-by: [email protected] <[email protected]>

updated

01e5864

Signed-off-by: [email protected] <[email protected]>

updated

04cba85

Signed-off-by: [email protected] <[email protected]>

updated

06c5c39

Signed-off-by: [email protected] <[email protected]>

updated

9a87c34

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat commented May 4, 2025

View reviewed changes

updated

027689d

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat marked this pull request as ready for review May 4, 2025 19:18

tlrmchlsmth approved these changes May 4, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

robertgshaw2-redhat and others added 2 commits May 4, 2025 16:21

Update vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

48add56

Co-authored-by: Tyler Michael Smith <[email protected]>

Update vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

ed6fd4f

Co-authored-by: Tyler Michael Smith <[email protected]>

robertgshaw2-redhat merged commit 06847be into neuralmagic:disagg_pd_dev May 4, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[P/D Disagg] [1/N] Support Homogeneous TP > 1 #65

[P/D Disagg] [1/N] Support Homogeneous TP > 1 #65

Uh oh!

robertgshaw2-redhat commented May 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

robertgshaw2-redhat May 3, 2025

Uh oh!

robertgshaw2-redhat May 4, 2025

Uh oh!

njhill May 4, 2025

Uh oh!

robertgshaw2-redhat May 4, 2025

Uh oh!

robertgshaw2-redhat May 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[P/D Disagg] [1/N] Support Homogeneous TP > 1 #65

[P/D Disagg] [1/N] Support Homogeneous TP > 1 #65

Uh oh!

Conversation

robertgshaw2-redhat commented May 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat May 3, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 4, 2025

Choose a reason for hiding this comment

Uh oh!

njhill May 4, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 4, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

robertgshaw2-redhat commented May 3, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat May 4, 2025 •

edited

Loading