[Bugfix][PD] Fix multi-node TP (TP>8)#39907
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces synchronization of the engine_id across nodes in multi-node Tensor Parallel (TP) configurations to ensure consistency. The review feedback highlights that synchronization should be performed across the entire world group, including Pipeline Parallelism (PP) dimensions, rather than just the TP group, to prevent handshake failures in complex distributed setups.
| @@ -57,6 +57,7 @@ | |||
| from vllm.distributed.parallel_state import ( | |||
| get_tensor_model_parallel_rank, | |||
| get_tensor_model_parallel_world_size, | |||
| get_tp_group, | |||
| # In multi-node TP, each node independently generates a random | ||
| # engine_id. Broadcast rank 0's engine_id to ensure consistency. | ||
| if self.world_size > 1: | ||
| self.engine_id = get_tp_group().broadcast_object(self.engine_id, src=0) | ||
| logger.debug( | ||
| "TP engine_id standardized to %s from previous config value %s", | ||
| self.engine_id, | ||
| engine_id, | ||
| ) |
There was a problem hiding this comment.
The engine_id must be consistent across the entire vLLM instance to avoid handshake failures in multi-node configurations that use both Tensor Parallelism (TP) and Pipeline Parallelism (PP). Using get_world_group() ensures the ID is synchronized across all ranks in the deployment, whereas get_tp_group() only covers the TP dimension. Additionally, the condition should check the global world size to ensure synchronization happens in PP-only distributed setups as well.
| # In multi-node TP, each node independently generates a random | |
| # engine_id. Broadcast rank 0's engine_id to ensure consistency. | |
| if self.world_size > 1: | |
| self.engine_id = get_tp_group().broadcast_object(self.engine_id, src=0) | |
| logger.debug( | |
| "TP engine_id standardized to %s from previous config value %s", | |
| self.engine_id, | |
| engine_id, | |
| ) | |
| # In multi-node distributed setups, each node independently generates a random | |
| # engine_id. Broadcast rank 0's engine_id to ensure consistency across all ranks. | |
| if get_world_group().world_size > 1: | |
| self.engine_id = get_world_group().broadcast_object(self.engine_id, src=0) | |
| logger.debug( | |
| "Engine engine_id standardized to %s from previous config value %s", | |
| self.engine_id, | |
| engine_id, | |
| ) |
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| @@ -64,6 +79,8 @@ def ensure_kv_transfer_initialized( | |||
| vllm_config.kv_transfer_config.is_kv_transfer_instance | |||
| and _KV_CONNECTOR_AGENT is None | |||
| ): | |||
| _sync_engine_id_across_tp(vllm_config) | |||
There was a problem hiding this comment.
I feel this is not a great place for the engine ID sync, since the engine ID lives in the vLLM config and we are only doing this here when there are KV connectors.
This will break if there is some other code that uses the engine ID (either presently or if it gets added in the future)
There was a problem hiding this comment.
engine_id is part of the vllm_config.kv_transfer_config.engine_id, which I feel semantically should entail a connector has to be present.
If engine_id is used even without connector we should probably pull it out of kv_transfer_config
There was a problem hiding this comment.
@njhill any opinion on where it might be best to place this engine_id sync?
|
Thanks for reviewing @tlrmchlsmth ! |
bdd9383 to
fc420cd
Compare
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Sorry - stale message. You're right @NickLucche - this does seem like an ok spot for the sync
81b31a9 to
f8cb4f6
Compare
f8cb4f6 to
2a40e79
Compare
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
9a24d08 to
fb58d9a
Compare
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
As reported by @S1ro1 , starting a PD setup with NixlConnector on a multi-node setup (ie
-tp 16) will result in:during handshake, as the tp workers spawned on node1 will have a different engine_id, which gets generated here
vllm/vllm/config/kv_transfer.py
Lines 93 to 95 in db8d4a4
This PR adds an additional "TP-sync" step
during connector initto exchange engine_id across tp workers, so they all have the same id as rank0 regardless of node.EDIT: the sync step has been moved to "gpu_worker-level" so that it's agnostic to the kind of connector being used.
This is particularly important for NVL72 systems (where TP>8 makes more sense) as well as GB systems with node size 4.
A simpler alternative would be to just ask the user to set engine_id manually when the error above is raised, so that all TP ranks have the same engine id across nodes.
Any thoughts on these options? @markmc @robertgshaw2-redhat @tlrmchlsmth