[ray, worker] feat: create one placement group across all nodes in resource pool and implement IP-based worker sorting#4792
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and valuable change to worker initialization by creating a single placement group and sorting workers by IP address. This should resolve the scheduling inconsistencies mentioned. The two-phase initialization (__init__ followed by init_worker) is a good architectural pattern for this purpose and has been correctly propagated to most worker types. My review focuses on a couple of critical/high-severity issues related to this new architecture: one concerning the debuggability of workers due to naming, and another a potential critical bug in the initialization of fused workers.
verl/single_controller/ray/base.py
Outdated
| # TODO: worker_name may be inaccurate after adjust_rank, need to fix | ||
| name = f"{self.name_prefix}{cia_name}_{pg_idx}:{local_rank}" # e.g. Worker_2:5 |
There was a problem hiding this comment.
The worker name is generated using the initial rank and local_rank before IP-based sorting. After sorting, the worker's actual rank will change, but its name in Ray will remain the same. This discrepancy can make debugging very difficult, as the actor's name will not correspond to its logical rank within the worker group. This is especially problematic when trying to identify or re-attach to detached workers by name.
The TODO comment acknowledges this. This should be addressed to improve maintainability and reduce confusion during debugging.
…oup PACK strategy topology issue Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
8530d39 to
4a71aab
Compare
What does this PR do?
As described in #4903 (comment), to address the issue where consecutive bundles in a single cross-node placement_group may be scheduled to different nodes (ray issue #59547), based on @wuxibin89 's work #4583, we implemented a solution referencing the approach used in vllm. After instantiating the workers, we sort them by IP address, adjust their ranks accordingly, and then proceed with the initialization.
Checklist Before Starting
[ray] feat: resource pool create one placement group across all nodes #4583
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)