Sokoban model response turned '!!!!!!!' after steps for training #52

scris · 2025-02-26T05:51:59Z

Thanks for the brilliant and inspiring work, but I met some problem during training:

The response of sokoban RL training is correct at first:

But after some steps, it went weird as '!!!!!', stay score -5.5 and did not recover from then on:

For this script (sokoban base prompt in cmd.md), the result turned to be '!!!!!' at epoch 0 step 63.

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-0.5B-Instruct \
    training.train_batch_size=4 \
    training.max_turns=5 \
    training.n_rollout=8 \
    training.micro_batch_size=4 \
    training.ppo_batch_size=32 \
    optimization.kl_coef=0.001 \
    optimization.adv_estimator=brpo

For this script, the result turned to be '!!!!!' at epoch 0 step 64.

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-0.5B-Instruct

For this script, the result turned to be '!!!!!' at epoch 0 step 68.

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-0.5B-Instruct \
    training.train_batch_size=4 \
    training.max_turns=5 \
    training.n_rollout=4 \
    training.micro_batch_size=2

For this script, the result turned to be '!!!!!' at epoch 0 step 95 for 0.5B, and epoch 0 step 51 for 1.5B.

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-[0.5B|1.5B]-Instruct \
    training.train_batch_size=4 \
    training.max_turns=5 \
    training.n_rollout=8

Also, critic/reward/mean fails to go higher after approximately step 30 in my tests.

These tests are done with Python 3.10, PyTorch 2.4, CUDA 12.1 on one Nvidia A100 or one Nvidia A6000. Any suggestions on how to solve this problem? Thanks very much!

The text was updated successfully, but these errors were encountered:

ZihanWang314 · 2025-02-26T21:12:56Z

This is what we are dealing with in our coding report. We believe this is a problem with the algorithm rather than implementation -> when the model is trained enough steps, there will be some "model collapse" issues.
Please stay tuned for our updates!

scris · 2025-02-27T01:19:28Z

Thanks for your swift reply! I'm really interested in your project.

scris · 2025-02-27T14:56:10Z

These experiments above are for replicating https://github.com/ZihanWang314/RAGEN/blob/main/public/loss_curve.png. But clearly I failed to do that with an almost-up-to-date code (db0fa39 on 2.21), so I tried another version (f991d1d on 1.28, the version when loss_curve.png is uploaded), and with Qwen 2.5 0.5B Instruct, I got these results:

I found 2.21 code resulted in -5.5 straight line (as response turned '!!!!!!', exp3 and exp16), but 1.28 code can successfully run for more steps (exp20). I also noticed that the 1.28 code use max_step = 5 (easier data) and penalize invalid response with -0.1, which are different from 2.21. I redo the experiment with max_step = 10 and -1 penalty (exp22), which are identical to 2.21's setting, and no mode collapse is found either, and model seems to be learning more comparing to exp3 and exp16 in the perspective of the content quality of the trajectory model response, although the training process is a bit slower.

May I ask what's the difference of the two code versions, 2.21 db0fa39 and 1.28 f991d1d? Thank you very much.

ZihanWang314 · 2025-03-07T16:22:46Z

Thanks for the update, sorry for the late reply! I think the main difference lies in the default config. Over the past days we are trying to implement as good as possible baseline config so everyone could be able to serve them with less effort into their own tasks. However there do sometimes will be tradeoffs. We find the current observations for now:

GAE is needed
need larger PPO batch size
need 1.5B level models

I hope it helps!

scris · 2025-03-07T16:26:53Z

Thanks for your kindly reply! I do find kl-divergence blow up and reward may not be very good when using GRPO instead of GAE (#58 refines the result, but not solving the problem for GRPO). Do you have any ideas to share on why GRPO won't work? Thanks a lot!

ZihanWang314 · 2025-03-07T16:30:47Z

We are not very sure now, haha.
But we find other developers do share similar observations, like the Open-Reasoner-Zero team. Feel free to check https://x.com/rosstaylor90/status/1892664646890312125

scris · 2025-03-07T16:32:31Z

Thanks! I'll check that.

ZihanWang314 · 2025-03-09T02:53:50Z

#58
Please refer to this issue. This is actually fixed by changing to the latest implementation of veRL. Thanks!

scris · 2025-03-09T03:30:08Z

Did that mean that on your side, GRPO works for now for 1.5B models? (And still won't work for 0.5B models?) Thanks!

scris · 2025-03-09T04:03:13Z

However, I still witnessed some problem on my side for GRPO running with 1.5B and 0.5B models (The result IS better, but issue is NOT solved), do you have any idea? Thanks!

bash train.sh sokoban \
    model.base_model=Qwen/Qwen2.5-0.5B-Instruct \
    training.train_batch_size=4 \
    training.max_turns=5 \
    training.n_rollout=4 \
    training.micro_batch_size=2

All experiments are done with use_kl_loss = True (which is suggested for GRPO) for alignment.
Image is edited, with newer experiment results.

ZihanWang314 · 2025-03-09T04:07:25Z

We tested on bandit tasks but not on Sokoban (will also push a mimor bugfix for bandit tasks). Let me reopen this issue and see if there's any other problems.

ZihanWang314 · 2025-03-09T04:20:41Z

By the way, it seems that PPO does not have issues at your side. Could you please try using PPO?
We noticed someone in Chinese QA platforms that GRPO could be a biased estimator. We are not sure if it is because of this kind of problem. Please feel free to check this https://zhuanlan.zhihu.com/p/28735759256

scris · 2025-03-09T06:17:07Z

Thanks! I'm just using some time every day trying to figure out GRPO :)
And thanks for your project :)

ZihanWang314 closed this as completed Mar 9, 2025

ZihanWang314 reopened this Mar 9, 2025

scris changed the title ~~Sokoban model response turned '!!!!!!!' after steps for PPO training~~ Sokoban model response turned '!!!!!!!' after steps for training Mar 9, 2025

0russwest0 mentioned this issue Mar 16, 2025

Sudden val drop during grpo on HotpotQA 0russwest0/Agent-R1#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sokoban model response turned '!!!!!!!' after steps for training #52

Sokoban model response turned '!!!!!!!' after steps for training #52

scris commented Feb 26, 2025 •

edited

Loading

ZihanWang314 commented Feb 26, 2025

scris commented Feb 27, 2025

scris commented Feb 27, 2025

ZihanWang314 commented Mar 7, 2025

scris commented Mar 7, 2025 •

edited

Loading

ZihanWang314 commented Mar 7, 2025

scris commented Mar 7, 2025

ZihanWang314 commented Mar 9, 2025

scris commented Mar 9, 2025 •

edited

Loading

scris commented Mar 9, 2025 •

edited

Loading

ZihanWang314 commented Mar 9, 2025 •

edited

Loading

ZihanWang314 commented Mar 9, 2025

scris commented Mar 9, 2025

Sokoban model response turned '!!!!!!!' after steps for training #52

Sokoban model response turned '!!!!!!!' after steps for training #52

Comments

scris commented Feb 26, 2025 • edited Loading

ZihanWang314 commented Feb 26, 2025

scris commented Feb 27, 2025

scris commented Feb 27, 2025

ZihanWang314 commented Mar 7, 2025

scris commented Mar 7, 2025 • edited Loading

ZihanWang314 commented Mar 7, 2025

scris commented Mar 7, 2025

ZihanWang314 commented Mar 9, 2025

scris commented Mar 9, 2025 • edited Loading

scris commented Mar 9, 2025 • edited Loading

ZihanWang314 commented Mar 9, 2025 • edited Loading

ZihanWang314 commented Mar 9, 2025

scris commented Mar 9, 2025

scris commented Feb 26, 2025 •

edited

Loading

scris commented Mar 7, 2025 •

edited

Loading

scris commented Mar 9, 2025 •

edited

Loading

scris commented Mar 9, 2025 •

edited

Loading

ZihanWang314 commented Mar 9, 2025 •

edited

Loading