-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dm control humanoid ppo learnable stand, walk, run #484
Dm control humanoid ppo learnable stand, walk, run #484
Conversation
…hard envs), base for easier versions called 'standard'
…hard envs), base for easier versions called 'standard'
don't worry if ppo can't solve harder versions. No one has solved them. Typically only SAC or other off policy methods have worked |
also, I can easily add a white wall in the background, change the camera to be more eagle eyed view so only ground is in view, etc. -want to double check what you would like before i would make a nice render for the docs |
okay makes sense, the only difference between the standard and hard versions are that standard gets more observations (link velocities), and way less randomization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might also want to merge main and check if it works. There might have been a breaking change with some config namings (we are trying to rename everything from cfg to config for consistency)
Oh also for the example ppo commands, I have organized a bit more now so they should go to the baselines/ppo/examples.sh file now. There is also a baselines.sh file which will be for official baseline results we upload to wandb but I am not sure whether these locomotion tasks should be part of the RL benchmark since they don't have notions of success (hard to take a e.g. averaged success rate graph of all tasks to compare rl algorithms). At best maybe some normalized score function but I never liked how people used that since that's basically not interpretable. |
one tiny change then I can merge. |
Stand
8.mp4
(no graph for stand due to change in ppo.py that I pulled in - logger defaults to None and doesn't log episodic return anymore - other two tasks and rgb only were run before I pulled this change - though, stand converged every time for me)
command for running (above is seed 1 result):
for i in {1..3}; do python ppo.py --exp_name="__final_standseed${i}" --env_id="MS-HumanoidStand-v1" --num_envs=2048 --update_epochs=8 --num_minibatches=32 --total_timesteps=40_000_000 --eval_freq=10 --num_eval_steps=1000 --num_steps=200 --gamma=0.95 --seed=${i}; done
Walk
video too large: https://drive.google.com/file/d/1ssTGoGHvvOf6e2RTgvIi6Je3xWV_VpwY/view?usp=sharing
command (video above is seed 1)
for i in {1..3}; do python ppo.py --exp_name="__final_walkseed${i}" --env_id="MS-HumanoidWalk-v1" --num_envs=2048 --update_epochs=8 --num_minibatches=32 --total_timesteps=80_000_000 --eval_freq=10 --num_eval_steps=1000 --num_steps=200 --gamma=0.97 --seed=${i} --ent_coef=1e-3; done
Run
video too large: https://drive.google.com/file/d/1hkNfUcv04hPVnuyWNSvp9swrNjhnwDBf/view?usp=sharing
command (video above is seed 1)
for i in {1..3}; do python ppo.py --exp_name="__final_runseed${i}" --env_id="MS-HumanoidRun-v1" --num_envs=2048 --update_epochs=8 --num_minibatches=32 --total_timesteps=80_000_000 --eval_freq=10 --num_eval_steps=1000 --num_steps=200 --gamma=0.97 --seed=${i} --ent_coef=1e-3; done
Run RGB Only
video too large: https://drive.google.com/file/d/1QZ1R7OrJLc8YlOY28tpHhgRj-FCejKzr/view?usp=sharing
command:
python ppo_rgb.py --exp_name="__human_rgb_run2" --env_id="MS-HumanoidRun-v1" --num_envs=256 --update_epochs=8 --num_minibatches=32 --total_timesteps=80_000_000 --eval_freq=15 --num_eval_steps=1000 --num_steps=200 --gamma=0.98 --seed=1 --no-include-state --render_mode="rgb_array" --ent_coef=1e-3
in addition, i have written and slightly tested out the hard versions of these environments (the code exists and is commented out in control/humanoid.py), but they seem only learnable via sac, and take very long to learn and confirm and potentially debug while adding little additional value