[trainer, fsdp, megatron] feat: support one_step_off_policy on Ascend NPU#4686
[trainer, fsdp, megatron] feat: support one_step_off_policy on Ascend NPU#4686ji-huazhong merged 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for one_step_off_policy on Ascend NPUs by replacing ray.util.collective with a manual torch.distributed process group setup, which is a sound approach to support the hccl backend. The changes are consistent across the modified files. However, I've found a critical race condition in recipe/one_step_off_policy/ray_trainer.py during the weight synchronization group creation. The initialization for the actor worker group is not awaited, which could lead to runtime errors. I've provided a suggestion to fix this issue by ensuring both actor and rollout worker groups are fully initialized before proceeding.
|
@baymax591
After investigating, I traced the error to the following code path:
File: ./recipe/one_step_off_policy/distributed_util.py At this line, the code calls a utility function from vLLM: However, the implementation in vllm/distributed/utils only supports IPv4 addresses. When the training environment uses IPv6, this results in a failure during distributed initialization. Therefore, I guess the one-step-off-policy distributed training pipeline currently does not support IPv6? |


What does this PR do?
Since Ray's collective communication interface does not support the hccl backend, we refer to the example code of vLLM and complete the weight synchronization between actor and rollout.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)