[Core][Engine] allow DP ray placement groups to be set on specific nodes#44669
Merged
robertgshaw2-redhat merged 2 commits intoJun 5, 2026
Conversation
Signed-off-by: walterbm <walter.beller.morales@gmail.com>
robertgshaw2-redhat
approved these changes
Jun 5, 2026
vrdn-23
added a commit
to vrdn-23/vllm
that referenced
this pull request
Jun 5, 2026
Resolves the recurring envs.py merge conflict per docs/superpowers/specs/2026-05-14-envs-merge-conflict-resolution-design.md. The legacy `if TYPE_CHECKING:` block and `environment_variables: dict[str, Callable]` runtime mapping were dropped on the branch in favor of pydantic `*Settings(BaseSettings)` subclasses. Every main-side edit to either location therefore conflicts mechanically; structural resolution is `--ours` for vllm/envs.py, then port the semantic delta as new `Field(...)` declarations on the appropriate sub-model. Main-side commits since merge base afcb580, with port disposition: - c73b0d0 (vllm-project#44669) — adds VLLM_RAY_DP_PLACEMENT_NODE_IPS (str=""). Ported to DistributedSettings.ray_dp_placement_node_ips. - 165b786 (vllm-project#40426) — adds VLLM_ROCM_USE_AITER_LINEAR_HIPBMM (bool=False). Ported to RocmSettings.rocm_use_aiter_linear_hipbmm. Native pydantic bool parsing replaces the `.lower() in ("true","1")` lambda. - 38fd240 (vllm-project#41980) — adds VLLM_DISTRIBUTED_USE_SPLIT_GROUP (bool=False). Ported to DistributedSettings.distributed_use_split_group. Native pydantic bool parsing replaces the `bool(int(...))` lambda. - a618356 (vllm-project#43447) — adds VLLM_PREFIX_CACHE_RETENTION_INTERVAL (int|None=None, tri-state). Ported to ServerSettings.prefix_cache_retention_interval; pydantic's unset-vs-explicit-zero handling matches the original `"X" in os.environ` guard. - bd98e97 (vllm-project#44128) — removes dead VLLM_RPC_TIMEOUT. Mirrored on the branch by deleting ServerSettings.rpc_timeout. Verification: vllm.envs imports cleanly; all four new vars read defaults and parse env-set values (incl. tri-state INTERVAL=0); VLLM_RPC_TIMEOUT correctly raises AttributeError; pre-commit passes ruff/format/mypy. Signed-off-by: Vinay Damodaran <vrdn@hey.com>
4 tasks
knight0528
pushed a commit
to knight0528/vllm
that referenced
this pull request
Jun 8, 2026
…des (vllm-project#44669) Signed-off-by: walterbm <walter.beller.morales@gmail.com>
ekagra-ranjan
pushed a commit
to ekagra-ranjan/vllm
that referenced
this pull request
Jun 9, 2026
…des (vllm-project#44669) Signed-off-by: walterbm <walter.beller.morales@gmail.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
waqahmed-amd-fi
pushed a commit
to waqahmed-amd-fi/vllm
that referenced
this pull request
Jun 10, 2026
…des (vllm-project#44669) Signed-off-by: walterbm <walter.beller.morales@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
When running a DP deployment in ray
create_dp_placement_groupsscans the whole Ray cluster and greedily packs DP ranks wherever there are free devices. When multiple independent DP engines share one Ray cluster (e.g. P/D disaggregation or multi-model serving under an external orchestrator), engines race for the same devices: one engine's remote ranks land on another engine's master node, which then fails withNot enough resources to allocate ... DP ranks on DP master node ...This adds an opt-in
VLLM_RAY_DP_PLACEMENT_NODE_IPS(comma-separated node IPs). When set, DP placement is restricted to those nodes (the DP master is always included). Lets an orchestrator partition a shared cluster into per-engine node sets and remove the race by construction.Test Plan
Added a unit test to confirm
Test Result
✅
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.