-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPO training fails with NCCL timeout when running on larger models #373
Comments
Hi, I think I know the solution for this problem, it's only tangentially related to #319, I suspect here's the cause is when one rank completes |
Thanks! It sounds reasonable. So is this a bug or do I need to make some changes in PPO training to solve this problem? |
@agave233 It is a bug, however it occurs under rather stochastic conditions, it will not be triggered if the model doesn't collapse to empty outputs. You could reduce learning rate, or increase batch size, if possible, to remedy that. |
Thanks for your suggestion. I have tried but it can not work. I'm a bit curious why a certain process can complete this process first. Is there no inter-process synchronization mechanism in the process of making experiences? |
Hi @agave233, could you post the script you've used and the git commit, so I can reproduce this particular bug? I'm closing in on a fix for it.
There was no need for it prior, except apparently the corner cases like yours |
The timeout problem was resolved according to the latest code. Thanks 👍 |
I am facing the same problem, but the model does not even start training. It seems to timeout in some reduce operation. I am trying to train the 1B model on --num_processes 3. I am using the latest code. Any idea of what could go wrong? Trace below
|
@javirandor Hm, it may be that it hangs on the first barrier (given SeqNum=1) here: trlx/trlx/trainer/accelerate_base_trainer.py Lines 65 to 66 in 9bc0836
Try commenting those lines and give it another attempt. Also have you tried running unmodified exisiting example on your setup, or does it also fail with the same error? |
Hello,
I have successfully run the code summarize_rlhf with small SFT and RM models (bloom1b). However, when I try to run the larger model (7B), the timeout error is raised, which is a similar problem as stated in this issue #319. But I can not find a solution.
My environment:
trlx version: 0.5.0
accelerate: 0.17.1
torch version: 1.13.1
The error is as follows:
The text was updated successfully, but these errors were encountered: