python3 train.py
After traning done, LSTM PPO
vs GRU PPO
vs PPO
results are saved as videos
[Note]
all parameters are same, except which has recurrent neural network or not.
LSTM and GRU, both have same hidden_state shape.
- LSTM PPO
LSTM-episode-0.mp4
- GRU PPO
GRU-episode-0.mp4
- PPO
PPO-episode-0.mp4
torch: 1.13.1+cu116
stable_baselines3: 2.3.0
sb3_contrib: 2.3.0
preference based RL with GRU reward model
for imitation library
https://github.com/CAI23sbP/RecurrentRLHF