Skip to content

[Bugfix][PD] Fix multi-node TP (TP>8)#39907

Merged
NickLucche merged 6 commits into
vllm-project:mainfrom
NickLucche:nixl-fix-multinode-tp
May 13, 2026
Merged

[Bugfix][PD] Fix multi-node TP (TP>8)#39907
NickLucche merged 6 commits into
vllm-project:mainfrom
NickLucche:nixl-fix-multinode-tp

Conversation

@NickLucche

@NickLucche NickLucche commented Apr 15, 2026

Copy link
Copy Markdown
Member

As reported by @S1ro1 , starting a PD setup with NixlConnector on a multi-node setup (ie -tp 16) will result in:

RuntimeError: Remote NIXL agent engine ID mismatch.

during handshake, as the tp workers spawned on node1 will have a different engine_id, which gets generated here

def __post_init__(self) -> None:
if self.engine_id is None:
self.engine_id = str(uuid.uuid4())
.

This PR adds an additional "TP-sync" step during connector init to exchange engine_id across tp workers, so they all have the same id as rank0 regardless of node.
EDIT: the sync step has been moved to "gpu_worker-level" so that it's agnostic to the kind of connector being used.

This is particularly important for NVL72 systems (where TP>8 makes more sense) as well as GB systems with node size 4.

A simpler alternative would be to just ask the user to set engine_id manually when the error above is raised, so that all TP ranks have the same engine id across nodes.
Any thoughts on these options? @markmc @robertgshaw2-redhat @tlrmchlsmth

@mergify mergify Bot added bug Something isn't working kv-connector labels Apr 15, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces synchronization of the engine_id across nodes in multi-node Tensor Parallel (TP) configurations to ensure consistency. The review feedback highlights that synchronization should be performed across the entire world group, including Pipeline Parallelism (PP) dimensions, rather than just the TP group, to prevent handshake failures in complex distributed setups.

@@ -57,6 +57,7 @@
from vllm.distributed.parallel_state import (
get_tensor_model_parallel_rank,
get_tensor_model_parallel_world_size,
get_tp_group,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Import get_world_group instead of get_tp_group. The engine_id needs to be synchronized across all parallel dimensions (including Pipeline Parallelism) to ensure consistency in multi-node deployments.

Suggested change
get_tp_group,
get_world_group,

Comment on lines +193 to +201
# In multi-node TP, each node independently generates a random
# engine_id. Broadcast rank 0's engine_id to ensure consistency.
if self.world_size > 1:
self.engine_id = get_tp_group().broadcast_object(self.engine_id, src=0)
logger.debug(
"TP engine_id standardized to %s from previous config value %s",
self.engine_id,
engine_id,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The engine_id must be consistent across the entire vLLM instance to avoid handshake failures in multi-node configurations that use both Tensor Parallelism (TP) and Pipeline Parallelism (PP). Using get_world_group() ensures the ID is synchronized across all ranks in the deployment, whereas get_tp_group() only covers the TP dimension. Additionally, the condition should check the global world size to ensure synchronization happens in PP-only distributed setups as well.

Suggested change
# In multi-node TP, each node independently generates a random
# engine_id. Broadcast rank 0's engine_id to ensure consistency.
if self.world_size > 1:
self.engine_id = get_tp_group().broadcast_object(self.engine_id, src=0)
logger.debug(
"TP engine_id standardized to %s from previous config value %s",
self.engine_id,
engine_id,
)
# In multi-node distributed setups, each node independently generates a random
# engine_id. Broadcast rank 0's engine_id to ensure consistency across all ranks.
if get_world_group().world_size > 1:
self.engine_id = get_world_group().broadcast_object(self.engine_id, src=0)
logger.debug(
"Engine engine_id standardized to %s from previous config value %s",
self.engine_id,
engine_id,
)

@NickLucche NickLucche requested a review from xuechendi as a code owner April 17, 2026 08:50
@mergify

mergify Bot commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Comment thread vllm/distributed/kv_transfer/kv_transfer_state.py Outdated
@@ -64,6 +79,8 @@ def ensure_kv_transfer_initialized(
vllm_config.kv_transfer_config.is_kv_transfer_instance
and _KV_CONNECTOR_AGENT is None
):
_sync_engine_id_across_tp(vllm_config)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is not a great place for the engine ID sync, since the engine ID lives in the vLLM config and we are only doing this here when there are KV connectors.

This will break if there is some other code that uses the engine ID (either presently or if it gets added in the future)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine_id is part of the vllm_config.kv_transfer_config.engine_id, which I feel semantically should entail a connector has to be present.
If engine_id is used even without connector we should probably pull it out of kv_transfer_config

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill any opinion on where it might be best to place this engine_id sync?

@NickLucche

Copy link
Copy Markdown
Member Author

Thanks for reviewing @tlrmchlsmth !

@NickLucche NickLucche force-pushed the nixl-fix-multinode-tp branch from bdd9383 to fc420cd Compare April 20, 2026 12:10
@mergify

mergify Bot commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@NickLucche NickLucche requested a review from tlrmchlsmth April 23, 2026 15:06

@tlrmchlsmth tlrmchlsmth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - stale message. You're right @NickLucche - this does seem like an ok spot for the sync

@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label May 4, 2026
@NickLucche NickLucche enabled auto-merge (squash) May 4, 2026 17:15
@NickLucche NickLucche force-pushed the nixl-fix-multinode-tp branch from 81b31a9 to f8cb4f6 Compare May 6, 2026 07:27
@NickLucche NickLucche force-pushed the nixl-fix-multinode-tp branch from f8cb4f6 to 2a40e79 Compare May 6, 2026 09:06
@mergify mergify Bot added the v1 label May 6, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche NickLucche merged commit 71bcd02 into vllm-project:main May 13, 2026
62 checks passed
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
rishitdholakia13 pushed a commit to rishitdholakia13/vllm that referenced this pull request May 19, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants