[Bugfix][PD] Fix multi-node TP (TP>8) by NickLucche · Pull Request #39907 · vllm-project/vllm

NickLucche · 2026-04-15T13:31:31Z

As reported by @S1ro1 , starting a PD setup with NixlConnector on a multi-node setup (ie -tp 16) will result in:

RuntimeError: Remote NIXL agent engine ID mismatch.

during handshake, as the tp workers spawned on node1 will have a different engine_id, which gets generated here

vllm/vllm/config/kv_transfer.py

Lines 93 to 95 in db8d4a4

    
           def __post_init__(self) -> None: 
        
               if self.engine_id is None: 
        
                   self.engine_id = str(uuid.uuid4())

.

This PR adds an additional "TP-sync" step ~~during connector init~~ to exchange engine_id across tp workers, so they all have the same id as rank0 regardless of node.
EDIT: the sync step has been moved to "gpu_worker-level" so that it's agnostic to the kind of connector being used.

This is particularly important for NVL72 systems (where TP>8 makes more sense) as well as GB systems with node size 4.

A simpler alternative would be to just ask the user to set engine_id manually when the error above is raised, so that all TP ranks have the same engine id across nodes.
Any thoughts on these options? @markmc @robertgshaw2-redhat @tlrmchlsmth

gemini-code-assist

Code Review

This pull request introduces synchronization of the engine_id across nodes in multi-node Tensor Parallel (TP) configurations to ensure consistency. The review feedback highlights that synchronization should be performed across the entire world group, including Pipeline Parallelism (PP) dimensions, rather than just the TP group, to prevent handshake failures in complex distributed setups.

gemini-code-assist · 2026-04-15T13:33:39Z

@@ -57,6 +57,7 @@
 from vllm.distributed.parallel_state import (
    get_tensor_model_parallel_rank,
    get_tensor_model_parallel_world_size,
+    get_tp_group,


Import get_world_group instead of get_tp_group. The engine_id needs to be synchronized across all parallel dimensions (including Pipeline Parallelism) to ensure consistency in multi-node deployments.

Suggested change

get_tp_group,

get_world_group,

gemini-code-assist · 2026-04-15T13:33:39Z

+        # In multi-node TP, each node independently generates a random
+        # engine_id. Broadcast rank 0's engine_id to ensure consistency.
+        if self.world_size > 1:
+            self.engine_id = get_tp_group().broadcast_object(self.engine_id, src=0)
+            logger.debug(
+                "TP engine_id standardized to %s from previous config value %s",
+                self.engine_id,
+                engine_id,
+            )


The engine_id must be consistent across the entire vLLM instance to avoid handshake failures in multi-node configurations that use both Tensor Parallelism (TP) and Pipeline Parallelism (PP). Using get_world_group() ensures the ID is synchronized across all ranks in the deployment, whereas get_tp_group() only covers the TP dimension. Additionally, the condition should check the global world size to ensure synchronization happens in PP-only distributed setups as well.

Suggested change

# In multi-node TP, each node independently generates a random

# engine_id. Broadcast rank 0's engine_id to ensure consistency.

if self.world_size > 1:

self.engine_id = get_tp_group().broadcast_object(self.engine_id, src=0)

logger.debug(

"TP engine_id standardized to %s from previous config value %s",

self.engine_id,

engine_id,

)

# In multi-node distributed setups, each node independently generates a random

# engine_id. Broadcast rank 0's engine_id to ensure consistency across all ranks.

if get_world_group().world_size > 1:

self.engine_id = get_world_group().broadcast_object(self.engine_id, src=0)

logger.debug(

"Engine engine_id standardized to %s from previous config value %s",

self.engine_id,

engine_id,

)

mergify · 2026-04-17T08:54:47Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

tlrmchlsmth · 2026-04-17T12:47:32Z

@@ -64,6 +79,8 @@ def ensure_kv_transfer_initialized(
        vllm_config.kv_transfer_config.is_kv_transfer_instance
        and _KV_CONNECTOR_AGENT is None
    ):
+        _sync_engine_id_across_tp(vllm_config)


I feel this is not a great place for the engine ID sync, since the engine ID lives in the vLLM config and we are only doing this here when there are KV connectors.

This will break if there is some other code that uses the engine ID (either presently or if it gets added in the future)

engine_id is part of the vllm_config.kv_transfer_config.engine_id, which I feel semantically should entail a connector has to be present.
If engine_id is used even without connector we should probably pull it out of kv_transfer_config

@njhill any opinion on where it might be best to place this engine_id sync?

NickLucche · 2026-04-17T13:31:19Z

Thanks for reviewing @tlrmchlsmth !

mergify · 2026-04-20T12:15:06Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

tlrmchlsmth

Sorry - stale message. You're right @NickLucche - this does seem like an ok spot for the sync

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche requested review from ApostaC and orozery as code owners April 15, 2026 13:31

mergify Bot added bug Something isn't working kv-connector labels Apr 15, 2026

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

NickLucche requested a review from xuechendi as a code owner April 17, 2026 08:50

tlrmchlsmth reviewed Apr 17, 2026

View reviewed changes

Comment thread vllm/distributed/kv_transfer/kv_transfer_state.py Outdated

tlrmchlsmth reviewed Apr 17, 2026

View reviewed changes

NickLucche force-pushed the nixl-fix-multinode-tp branch from bdd9383 to fc420cd Compare April 20, 2026 12:10

NickLucche requested a review from tlrmchlsmth April 23, 2026 15:06

tlrmchlsmth approved these changes May 4, 2026

View reviewed changes

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label May 4, 2026

NickLucche enabled auto-merge (squash) May 4, 2026 17:15

NickLucche force-pushed the nixl-fix-multinode-tp branch from 81b31a9 to f8cb4f6 Compare May 6, 2026 07:27

NickLucche mentioned this pull request May 6, 2026

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure #41792

Merged

NickLucche force-pushed the nixl-fix-multinode-tp branch from f8cb4f6 to 2a40e79 Compare May 6, 2026 09:06

mergify Bot added the v1 label May 6, 2026

NickLucche added 5 commits May 11, 2026 07:15

init

4f9a1a2

Signed-off-by: NickLucche <nlucches@redhat.com>

lift sync to gpuworker

04e4b1d

Signed-off-by: NickLucche <nlucches@redhat.com>

redundant if check

f5adec9

Signed-off-by: NickLucche <nlucches@redhat.com>

precommit

6b8d2d8

Signed-off-by: NickLucche <nlucches@redhat.com>

fix test that didnt init tp group

fb58d9a

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the nixl-fix-multinode-tp branch from 9a24d08 to fb58d9a Compare May 11, 2026 07:21

NickLucche mentioned this pull request May 11, 2026

[Roadmap]: PD Disaggregation with NixlConnector Roadmap #33702

Open

66 tasks

Merge branch 'main' into nixl-fix-multinode-tp

3248492

NickLucche merged commit 71bcd02 into vllm-project:main May 13, 2026
62 checks passed

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[Bugfix][PD] Fix multi-node TP (TP>8) (vllm-project#39907)

a86b801

Signed-off-by: NickLucche <nlucches@redhat.com>

rishitdholakia13 pushed a commit to rishitdholakia13/vllm that referenced this pull request May 19, 2026

[Bugfix][PD] Fix multi-node TP (TP>8) (vllm-project#39907)

520dbbe

Signed-off-by: NickLucche <nlucches@redhat.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[Bugfix][PD] Fix multi-node TP (TP>8) (vllm-project#39907)

96b2dc8

Signed-off-by: NickLucche <nlucches@redhat.com>

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

[Bugfix][PD] Fix multi-node TP (TP>8) (vllm-project#39907)

b65e7bc

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche mentioned this pull request May 28, 2026

[Bugfix][KVConnector][NIXL] Sync engine_id across nodes for headless multi-node deployments #42928

Open

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[Bugfix][PD] Fix multi-node TP (TP>8) (vllm-project#39907)

6f30384

Signed-off-by: NickLucche <nlucches@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][PD] Fix multi-node TP (TP>8)#39907

[Bugfix][PD] Fix multi-node TP (TP>8)#39907
NickLucche merged 6 commits into
vllm-project:mainfrom
NickLucche:nixl-fix-multinode-tp

NickLucche commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

mergify Bot commented Apr 17, 2026

Uh oh!

Uh oh!

tlrmchlsmth Apr 17, 2026

Uh oh!

NickLucche Apr 17, 2026

Uh oh!

NickLucche May 4, 2026

Uh oh!

NickLucche commented Apr 17, 2026

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

tlrmchlsmth left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def __post_init__(self) -> None:
	if self.engine_id is None:
	self.engine_id = str(uuid.uuid4())

Uh oh!

Conversation

NickLucche commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 17, 2026

Uh oh!

Uh oh!

tlrmchlsmth Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche May 4, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Apr 17, 2026

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

tlrmchlsmth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NickLucche commented Apr 15, 2026 •

edited

Loading

tlrmchlsmth left a comment •

edited

Loading