-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control Hopper Env, Stand + Hop #440
Conversation
# dm_control also includes foot pressures as state obs space | ||
def _get_obs_extra(self, info: Dict): | ||
obs = dict( | ||
toe_touch=self.touch("foot_toe"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does dm control with visual data input ever use state information? This data will be part of the observation (along with _get_obs_agent())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now explicitly overriding _get_obs_state_dict to include entire state (touch + proprioception), so the touch information isn't allowed for rgbd obs mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dm_control uses pixel wrapper, which allows for user to choose either entire state + pixels or only pixels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looking at what seems to be standard practice, DrQ and TDMPC both do not use state at all in training, e.g. https://github.com/nicklashansen/tdmpc2/blob/5f6fadec0fec78304b4b53e8171d348b58cac486/tdmpc2/envs/wrappers/pixels.py#L8
Given the tasks are not that hard, we can do image only ppo as well. The way I will do it is to fix the flatten rgbd obsevation wrapper to be like the pixel wrapper, and you just choose if you want rgb, depth, and / or state, and it will handle it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re this, merge the main branch into your PR. I have updated code to do this exactly. The new ppo rgb script is this:
python ppo_rgb.py --exp_name="rgbhopseed1" --env_id="MS-HopperHop-v1" --num_envs=512 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=10 --num_eval_steps=600 --num_steps=50 --gamma=0.99 --seed=1 --no_include_state
(the hyperparams might not be well tuned).
found_lost_pairs_capacity=2**25, max_rigid_patch_count=2**18 | ||
), | ||
scene_cfg=SceneConfig( | ||
solver_position_iterations=15, solver_velocity_iterations=15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why these configs for solver? There is no need to try and match mujoco solver configurations since mujoco is a different physics engine compared to what we use which is physx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to solver_position_iteration from default 15 to 4, for sim efficiency (15 was unnecessary), and vel iterations stays default value of 1
solver_position_iterations=15, solver_velocity_iterations=15 | ||
), | ||
# sim_freq=50, | ||
control_freq=25, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is control_freq here decided, was it to try and match dm-control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
control_freq was set so the ratio of sim_freq to control is 4:1 (same as dm_control hopper), code now explicitly has sim_freq=100
uid="cam0", | ||
pose=Pose.create_from_pq([0, -2.8, 0], euler2quat(0, 0, np.pi / 2)), | ||
width=128, | ||
height=128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 128x128 correct? I recall dm-control uses smaller images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i found that images are 240 height, 320 width:
import dm_control.suite as suite
from dm_control.suite.wrappers import pixels
import numpy as np
env = suite.load("hopper", "hop")
env_and_pixels = pixels.Wrapper(env, pixels_only=True)
_, _, _, obs = env_and_pixels.reset()
print(obs['pixels'].shape) # prints (240, 320, 3)
this seems large, should we use the same dims?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually turns out people seem to use different resolutions. Let's just stick with 128x128 for now. It runs fast and is sufficient.
Just checked TDMPC2 code and they use 384x384. So not sure how this is picked usually.
for a in actor_builders: | ||
a.build(a.name) | ||
|
||
# load in the ground |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for hopper (and also cartpole). I recommend we add a massive long white wall behind so that the rendered videos look a bit nicer with color 0.3, 0.3, 0.3 (this will match what ray-traced videos will look like since they include the ambient lighting which is 0.3 0.3 0.3). Moreover with a big background wall if anyone wants to use depth image data they can and it won't be full of 0s (indicating no depth info since it goes to infinity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created new scene builder in mani_skill/utils/scene_builder/control/planar, for re-use for the other planar agents
new eval visual (512x512):
6.mp4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added wall to cartpole.py file directly, since it doesn't fit with planar scene builder or warrant its own scene builder
2.mp4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok we might want to play around with lighting in the future to make these look a bit brighter, but no need for now.
Is the ppo_rgb.py code updated? Seem to run into a few errors due to evaluating 100 steps instead of the max 600. |
yes, I added ppo_rgb 1 line change to get it to work in new commit - there was no check that success exists in final_info dict in ppo_rgb.py when tensorboard writing, while there is a check in ppo.py for i in {1..3}; do python ppo_rgb.py --exp_name="pixels_hopseed${i}" --env_id="MS-HopperHop-v1" --num_envs=256 --update_epochs=8 --num_minibatches=8 --total_timesteps=25_000_000 --num-steps=100 --num_eval_steps=600 --gamma=0.995 --seed=${i}; done performance is worse without the pressure sensor information |
It is hard to tell if it's "worse" since this is a different return scale now? Also i tried running again but the performance after 5M steps looks like this: 9.mp4Looks like it learns to flip forward to go fast now, instead of the original behavior. |
My script: python ppo_rgb.py --exp_name="rgbhopseed1" --env_id="MS-HopperHop-v1" --num_envs=512 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=15 --num_eval_steps=600 --num_steps=100 --gamma=0.99 --seed=1 --no-include-state (merge main for this to work) Maybe behavior will be removed if the agent does not have so much power behind its actions. I can try and experiment as well. |
This has been my experience with the damping, stiffness, and delta being very important for desired behvior. If the damping especially is too low, then the hopper is unable to upright itself, if it's too high, then it can instead flip to its feet too easily. |
can you somehow revert to the settings that got the first video that was working? or was it only working from state inputs |
it was only working from state inputs, these are the same controller settings that produced the first videos |
Arguably though with the flipping it seems like it achieves the same distance if not further than the standard "behavior". I would then say this is more of a result of the reward function. Since its not possible to get the same exact simulation as mujoco, maybe we can just penalize flipping (just penalize the x/y axis rotation velocity should be sufficient, whichever is the axis pointing to the camera). It's fine if our reward function is slightly different to mujocos. I find many other repos tackling locomotion like isaac lab / orbit will often add reward terms not meant to make the robot run faster, but to have a smoother gait, which is what we will be doing here essentially. |
okay, that sounds good, I will try some experiments with discouraging flipping reward |
No need to try to experiment, I have found that y axis rotation bounds for reward works well, will run tests and report here later tonight edit: update - It worked well for a quick check on state obs - to ensure no flipping with large deltas. However, I've tried many, many different things, and unfortunately, any additional reward doesn't fix the problem - in fact, I believe the flipping was originally a local min solution, where hopping would result in greater reward, since the flipping doesn't always activate the standing reward portion of the hopping reward currently testing out non delta joint pos |
what do you mean non delta joint pos? all of them should use that controller |
I've diagnosed the problem: Since the entire robot's center of mass x velocity isn't taken into consideration (and only the torso xvel which is attached to the slider), the robot isn't penalized for flipping in a large circle as long as its torso continues to move in the +x world frame direction Here is original robot center of mass velocity function, which is slow and doesn't work
it seems like the get_linear_velocity() function doesn't return the center of mass linear velocity of the links?, which is causing the issue. If calculating the center of mass velocity for an entire robot is non-trivial, this is an issue for hopper + all planar + all humanoid dm_control task implementations, since they com velocity as reward |
Lemme get back to you. I think I also just found a bug with the GPU sim linear and angular velocities. CPU sim is correct but on GPU sim they might flipped. Working with @fbxiang now to debug it. |
Actually you can use the torso link. It should work. That is the link they are reading the center of mass velocities from. All velocity data is center of mass actually. e.g.
Also merge in main before doing more experiments. I just fixed #450 merged in #452 , so the link velocities used or com velocities are wrong lol Thinking also more, maybe it is also better to just penalize the y-axis rotation, instead of velocity (its okay to go fast, just don't flip over). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small change requested i didnt catch earlier. See other comments above
**The results have large variance because in both hop and stand tasks, in seed 2, ppo learns to straighten its legs out too early, and doesn't explore tucking them in when hopper is on its belly, in order to get up - ppo then gets stuck at this local optimum in seed 2 - This is a a result of lack of exploration in ppo, not a fault of the environment, as shown below Standing results and command:7.mp4for i in {1..3}; do python ppo.py --exp_name="standseed${i}" --env_id="MS-HopperStand-v1" --num_envs=4096 --update_epochs=8 --num_minibatches=32 --total_timesteps=30_000_000 --eval_freq=10 --num_eval_steps=600 --num_steps=100 --gamma=0.99 --ent_coef=0.01 --seed=${i}; done Hopping results and command:12.mp4for i in {1..3}; do python ppo.py --exp_name="hopseed${i}" --env_id="MS-HopperHop-v1" --num_envs=4096 --update_epochs=8 --num_minibatches=32 --total_timesteps=60_000_000 --eval_freq=10 --num_eval_steps=600 --num_steps=100 --gamma=0.99 --ent_coef=0.01 --seed=${i}; done RGB Hop results and command:52.mp4python ppo_rgb.py --exp_name="pixelhop" --env_id="MS-HopperHop-v1" --num_envs=512 --update_epochs=8 --num_minibatches=32 --total_timesteps=40_000_000 --eval_freq=15 --num_eval_steps=600 --num_steps=100 --gamma=0.99 --seed=1 --no-include-state Local Optima in PPO trainingseed 2 ppo local optimum: 7.mp4seed 1 and 3 don't have this issue: seed 1 stand: 7.mp4seed 3 stand: 7.mp4**and even if the hopper learns to flip, it is a local minimum as it will eventually learn that hopping results in more rewards than flipping, this is because the center of mass of entire robot body reward correctly rewards hopping more than flipping seed 3 of hopping command above first learns to flip, then later learns better how to hop to maximize the reward, eventually forgetting the flipping behavior:** 9.mp4 |
Added hopper xml, hopper environment, hopper stand and hop tasks
MS-HopperStand-v1 & MS-HopperHop-v1
Made small changes to the following files
mani_skill/envs/sapien_env.py - added ability to add articulation mount for render only camera - this allows the ppo evaluation videos to be mounted to robot articulation (necessary to view hopper throughout episode and it's movement) - doesn't make any difference for _human_render_cameras if entity_uid in camera config isn't specified - as usual for all previous environments
mani_skill/utils/building/ground.py - added additional kwargs to ground builder to allow rectangular grounds (no longer just square) and ground xy offsets (both necessary for dm_control planar agent ground visual in mujoco planar agent environments style) - no difference for this api if these additional kwargs aren't specified - as usual for all previous environments
index - task documentation
Made large changes to the following files
mani_skill/envs/utils/rewards/common.py - value at margin wasn't implemented, and the sigmoidal reward fns weren't guaranteed to return rewards between 0 and 1! - made changes s.t. it's equivalent to dm_control reward tolerance
These changes make it much easier for implementation of other planar agents (like walker and cheetah), but in my experience, there are issues (like known and documented bugs in mjcf_loader.py and joint tuning difficulty) that make these other environments harder to implement
environment requires further tuning to more resemble dm_control hopper
5.mp4
2.mp4
args to recreate:
Hop
for i in {1..3}; do python ppo.py --exp_name="hopseed${i}" --env_id="MS-HopperHop-v1" --num_envs=2048 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=10 --num_eval_steps=100 --num_steps=100 --gamma=0.999 --seed=${i}; done
Stand
for i in {1..3}; do python ppo.py --exp_name="standseed${i}" --env_id="MS-HopperStand-v1" --num_envs=2048 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=10 --num_eval_steps=100 --num_steps=100 --gamma=0.99 --seed=${i}; done
HopRGB singleseed
python ppo_rgb.py --exp_name="rgbhopseed1" --env_id="MS-HopperHop-v1" --num_envs=1024 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=10 --num_eval_steps=100 --num_steps=100 --gamma=0.99 --seed=1