Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sokoban model response turned '!!!!!!!' after steps for training #52

Open
scris opened this issue Feb 26, 2025 · 13 comments
Open

Sokoban model response turned '!!!!!!!' after steps for training #52

scris opened this issue Feb 26, 2025 · 13 comments

Comments

@scris
Copy link

scris commented Feb 26, 2025

Thanks for the brilliant and inspiring work, but I met some problem during training:

The response of sokoban RL training is correct at first:

Image

But after some steps, it went weird as '!!!!!', stay score -5.5 and did not recover from then on:

Image

For this script (sokoban base prompt in cmd.md), the result turned to be '!!!!!' at epoch 0 step 63.

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-0.5B-Instruct \
    training.train_batch_size=4 \
    training.max_turns=5 \
    training.n_rollout=8 \
    training.micro_batch_size=4 \
    training.ppo_batch_size=32 \
    optimization.kl_coef=0.001 \
    optimization.adv_estimator=brpo

For this script, the result turned to be '!!!!!' at epoch 0 step 64.

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-0.5B-Instruct

For this script, the result turned to be '!!!!!' at epoch 0 step 68.

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-0.5B-Instruct \
    training.train_batch_size=4 \
    training.max_turns=5 \
    training.n_rollout=4 \
    training.micro_batch_size=2

For this script, the result turned to be '!!!!!' at epoch 0 step 95 for 0.5B, and epoch 0 step 51 for 1.5B.

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-[0.5B|1.5B]-Instruct \
    training.train_batch_size=4 \
    training.max_turns=5 \
    training.n_rollout=8

Also, critic/reward/mean fails to go higher after approximately step 30 in my tests.

These tests are done with Python 3.10, PyTorch 2.4, CUDA 12.1 on one Nvidia A100 or one Nvidia A6000. Any suggestions on how to solve this problem? Thanks very much!

@ZihanWang314
Copy link
Collaborator

This is what we are dealing with in our coding report. We believe this is a problem with the algorithm rather than implementation -> when the model is trained enough steps, there will be some "model collapse" issues.
Please stay tuned for our updates!

@scris
Copy link
Author

scris commented Feb 27, 2025

Thanks for your swift reply! I'm really interested in your project.

@scris
Copy link
Author

scris commented Feb 27, 2025

These experiments above are for replicating https://github.com/ZihanWang314/RAGEN/blob/main/public/loss_curve.png. But clearly I failed to do that with an almost-up-to-date code (db0fa39 on 2.21), so I tried another version (f991d1d on 1.28, the version when loss_curve.png is uploaded), and with Qwen 2.5 0.5B Instruct, I got these results:

Image

I found 2.21 code resulted in -5.5 straight line (as response turned '!!!!!!', exp3 and exp16), but 1.28 code can successfully run for more steps (exp20). I also noticed that the 1.28 code use max_step = 5 (easier data) and penalize invalid response with -0.1, which are different from 2.21. I redo the experiment with max_step = 10 and -1 penalty (exp22), which are identical to 2.21's setting, and no mode collapse is found either, and model seems to be learning more comparing to exp3 and exp16 in the perspective of the content quality of the trajectory model response, although the training process is a bit slower.

May I ask what's the difference of the two code versions, 2.21 db0fa39 and 1.28 f991d1d? Thank you very much.

@ZihanWang314
Copy link
Collaborator

Thanks for the update, sorry for the late reply! I think the main difference lies in the default config. Over the past days we are trying to implement as good as possible baseline config so everyone could be able to serve them with less effort into their own tasks. However there do sometimes will be tradeoffs. We find the current observations for now:

  1. GAE is needed
  2. need larger PPO batch size
  3. need 1.5B level models

I hope it helps!

@scris
Copy link
Author

scris commented Mar 7, 2025

Thanks for your kindly reply! I do find kl-divergence blow up and reward may not be very good when using GRPO instead of GAE (#58 refines the result, but not solving the problem for GRPO). Do you have any ideas to share on why GRPO won't work? Thanks a lot!

@ZihanWang314
Copy link
Collaborator

We are not very sure now, haha.
But we find other developers do share similar observations, like the Open-Reasoner-Zero team. Feel free to check https://x.com/rosstaylor90/status/1892664646890312125

@scris
Copy link
Author

scris commented Mar 7, 2025

Thanks! I'll check that.

@ZihanWang314
Copy link
Collaborator

#58
Please refer to this issue. This is actually fixed by changing to the latest implementation of veRL. Thanks!

@scris
Copy link
Author

scris commented Mar 9, 2025

Did that mean that on your side, GRPO works for now for 1.5B models? (And still won't work for 0.5B models?) Thanks!

@scris
Copy link
Author

scris commented Mar 9, 2025

However, I still witnessed some problem on my side for GRPO running with 1.5B and 0.5B models (The result IS better, but issue is NOT solved), do you have any idea? Thanks!

Image
bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-0.5B-Instruct \
    training.train_batch_size=4 \
    training.max_turns=5 \
    training.n_rollout=4 \
    training.micro_batch_size=2

All experiments are done with use_kl_loss = True (which is suggested for GRPO) for alignment.
Image is edited, with newer experiment results.

@ZihanWang314 ZihanWang314 reopened this Mar 9, 2025
@ZihanWang314
Copy link
Collaborator

ZihanWang314 commented Mar 9, 2025

We tested on bandit tasks but not on Sokoban (will also push a mimor bugfix for bandit tasks). Let me reopen this issue and see if there's any other problems.

@ZihanWang314
Copy link
Collaborator

By the way, it seems that PPO does not have issues at your side. Could you please try using PPO?
We noticed someone in Chinese QA platforms that GRPO could be a biased estimator. We are not sure if it is because of this kind of problem. Please feel free to check this https://zhuanlan.zhihu.com/p/28735759256

@scris
Copy link
Author

scris commented Mar 9, 2025

Thanks! I'm just using some time every day trying to figure out GRPO :)
And thanks for your project :)

@scris scris changed the title Sokoban model response turned '!!!!!!!' after steps for PPO training Sokoban model response turned '!!!!!!!' after steps for training Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants