-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sokoban model response turned '!!!!!!!' after steps for training #52
Comments
This is what we are dealing with in our coding report. We believe this is a problem with the algorithm rather than implementation -> when the model is trained enough steps, there will be some "model collapse" issues. |
Thanks for your swift reply! I'm really interested in your project. |
These experiments above are for replicating https://github.com/ZihanWang314/RAGEN/blob/main/public/loss_curve.png. But clearly I failed to do that with an almost-up-to-date code (db0fa39 on 2.21), so I tried another version (f991d1d on 1.28, the version when ![]() I found 2.21 code resulted in -5.5 straight line (as response turned '!!!!!!', exp3 and exp16), but 1.28 code can successfully run for more steps (exp20). I also noticed that the 1.28 code use max_step = 5 (easier data) and penalize invalid response with -0.1, which are different from 2.21. I redo the experiment with max_step = 10 and -1 penalty (exp22), which are identical to 2.21's setting, and no mode collapse is found either, and model seems to be learning more comparing to exp3 and exp16 in the perspective of the content quality of the trajectory model response, although the training process is a bit slower. May I ask what's the difference of the two code versions, 2.21 db0fa39 and 1.28 f991d1d? Thank you very much. |
Thanks for the update, sorry for the late reply! I think the main difference lies in the default config. Over the past days we are trying to implement as good as possible baseline config so everyone could be able to serve them with less effort into their own tasks. However there do sometimes will be tradeoffs. We find the current observations for now:
I hope it helps! |
Thanks for your kindly reply! I do find kl-divergence blow up and reward may not be very good when using GRPO instead of GAE (#58 refines the result, but not solving the problem for GRPO). Do you have any ideas to share on why GRPO won't work? Thanks a lot! |
We are not very sure now, haha. |
Thanks! I'll check that. |
#58 |
Did that mean that on your side, GRPO works for now for 1.5B models? (And still won't work for 0.5B models?) Thanks! |
We tested on bandit tasks but not on Sokoban (will also push a mimor bugfix for bandit tasks). Let me reopen this issue and see if there's any other problems. |
By the way, it seems that PPO does not have issues at your side. Could you please try using PPO? |
Thanks! I'm just using some time every day trying to figure out GRPO :) |
Thanks for the brilliant and inspiring work, but I met some problem during training:
The response of sokoban RL training is correct at first:
But after some steps, it went weird as '!!!!!', stay score -5.5 and did not recover from then on:
For this script (sokoban base prompt in
cmd.md
), the result turned to be '!!!!!' at epoch 0 step 63.For this script, the result turned to be '!!!!!' at epoch 0 step 64.
For this script, the result turned to be '!!!!!' at epoch 0 step 68.
For this script, the result turned to be '!!!!!' at epoch 0 step 95 for 0.5B, and epoch 0 step 51 for 1.5B.
bash train.sh sokoban \ model.base_model=Qwen/Qwen2.5-[0.5B|1.5B]-Instruct \ training.train_batch_size=4 \ training.max_turns=5 \ training.n_rollout=8
Also, critic/reward/mean fails to go higher after approximately step 30 in my tests.
These tests are done with Python 3.10, PyTorch 2.4, CUDA 12.1 on one Nvidia A100 or one Nvidia A6000. Any suggestions on how to solve this problem? Thanks very much!
The text was updated successfully, but these errors were encountered: