Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDPG/TD3 target_actor output clip #196

Closed
Tracked by #115
huxiao09 opened this issue Jun 10, 2022 · 19 comments · Fixed by #211
Closed
Tracked by #115

DDPG/TD3 target_actor output clip #196

huxiao09 opened this issue Jun 10, 2022 · 19 comments · Fixed by #211

Comments

@huxiao09
Copy link

huxiao09 commented Jun 10, 2022

Problem Description

Hi! It seems that the output of target_actor in DDPG/TD3 has been directly clipped to fit the action range boundaries, without multiplying by max_action. But in Fujimoto's DDPG/TD3 code[1] and some other implementations, the max_action has been add in the last tanh layer of the actor network, so they don't use clip. Have u ever tried the second implementation?

if global_step > args.learning_starts:
                data = rb.sample(args.batch_size)
                with torch.no_grad():
                    next_state_actions = (target_actor(data.next_observations)).clamp(
                        envs.single_action_space.low[0], envs.single_action_space.high[0]
                    )
                    qf1_next_target = qf1_target(data.next_observations, next_state_actions)
                    next_q_value = data.rewards.flatten() + (1 - data.dones.flatten()) * args.gamma * (qf1_next_target).view(-1)

[1] https://github.com/sfujim/TD3

@huxiao09 huxiao09 changed the title DDPG/TD3 policy network output clip DDPG/TD3 target_actor output clip Jun 10, 2022
@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 10, 2022

Thanks for raising this issue. I think you are referring to https://github.com/sfujim/TD3/blob/385b33ac7de4767bab17eb02ade4a268d3e4e24f/TD3.py#L28? Yeah this is a bit of a problem. That said, it's fine in MuJoCo because max_action is just 1.

@huxiao09
Copy link
Author

Yep. I still highly suggest you adopt Fujimoto's implementations, because in some MuJoCo environments (like Humanoid-v2) and some other environments the max_action may not be 1.

And I also highly suggest you add the implementation of pytorch network model saving to each of your single file, and then write a separate single file that loads the network model and tests it in the corresponding environment.

Thanks for your reply and your nice cleanRL code : )

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 14, 2022

Thanks for the suggestion! Fixing the max_action issue with the actor makes sense to me.

Model saving on the other hand is a bit more nuanced. For example, ppo_continuous_action.py technically also needs to save the running mean and standard deviations stored in the environment wrappers. Because of these complications, we have decided not doing model saving to keep the core implementation as simple as possible. Additionally, most people save the models only to load them later to see what the agent is actually doing. For this use case, we already included videos of the agents playing the game in the docs (e.g. link, further eliminating the use case of saving models.

@vwxyzjn vwxyzjn mentioned this issue Jun 20, 2022
19 tasks
@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 20, 2022

@dosssman This is indeed a much larger problem - I just tested it out and there is a number of environments that do not have 1 as the default max_action.

Testing environment: Ant-v2 Observation space: Box(-inf, inf, (111,), float64) Action space: Box(-1.0, 1.0, (8,), float32)
Testing environment: HalfCheetah-v2 Observation space: Box(-inf, inf, (17,), float64) Action space: Box(-1.0, 1.0, (6,), float32)
Testing environment: Hopper-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (3,), float32)
Testing environment: Humanoid-v2 Observation space: Box(-inf, inf, (376,), float64) Action space: Box(-0.4, 0.4, (17,), float32)
Testing environment: InvertedDoublePendulum-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (1,), float32)
Testing environment: InvertedPendulum-v2 Observation space: Box(-inf, inf, (4,), float64) Action space: Box(-3.0, 3.0, (1,), float32)
Testing environment: Pusher-v2 Observation space: Box(-inf, inf, (23,), float64) Action space: Box(-2.0, 2.0, (7,), float32)
Testing environment: Reacher-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (2,), float32)
Testing environment: Swimmer-v2 Observation space: Box(-inf, inf, (8,), float64) Action space: Box(-1.0, 1.0, (2,), float32)
Testing environment: Walker2d-v2 Observation space: Box(-inf, inf, (17,), float64) Action space: Box(-1.0, 1.0, (6,), float32)

@huxiao09 would you be interested in submitting a fix for this and going through the process of re-running benchmarks? Otherwise, no worries - we will take care of this.

@huxiao09
Copy link
Author

Sure, it's my pleasure. But I am busy this week. I'm afraid I can only try my best to finish it in two weeks. Is that too late?

@dosssman
Copy link
Collaborator

Thanks for the heads up. Will look into this.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 20, 2022

Two weeks is actually a great timeline. Thank you @huxiao09 for your interest in working on this!

The process would go like this

  1. file a PR to make the changes with the max_action
  2. we agree on the changes
  3. have you join our wandb team openrlbenchmark that tracks all of our benchmark experiments
  4. run the following benchmark scripts
    poetry install -E "mujoco pybullet"
    python -c "import mujoco_py"
    OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \
    --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 \
    --command "poetry run python cleanrl/ddpg_continuous_action.py --track --capture-video" \
    --num-seeds 3 \
    --workers 3

    poetry install -E "mujoco pybullet"
    python -c "import mujoco_py"
    OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \
    --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 \
    --command "poetry run python cleanrl/td3_continuous_action.py --track --capture-video" \
    --num-seeds 3 \
    --workers 3

Btw would this impact sac_continouous_action.py @dosssman? I see in the SAC file there is no clip action like done in PPO:

env = gym.wrappers.ClipAction(env)

@dosssman
Copy link
Collaborator

dosssman commented Jun 20, 2022 via email

@huxiao09
Copy link
Author

ok, get it :)

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 20, 2022

Thank you so much. Would you mind registering a wandb account and giving me your username? Would you also mind adding me on discord ( Costa#2021 )?

Thanks

@huxiao09
Copy link
Author

Yes, I have already had an account: https://wandb.ai/huxiao . But it seems that discord can't be open in my region. I'll find how to solve this problem tomorrow.

@dosssman dosssman mentioned this issue Jun 21, 2022
12 tasks
@dosssman
Copy link
Collaborator

dosssman commented Jun 21, 2022

Preliminary fix tracked here for TD3 and here for DDPG.

The corresponding PR is #211

Overall, it does not seem to make that big of a difference compared to the previous version, but this version would be theoretically correct indeed.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 24, 2022

I changed the run set name to as follows (customize the advanced legend to be ${runsetName}.

image

Could you do the following

  • remove the new td3_continuous_action_bound_fix.py experiments with HalfCheetah-v2 (since its action space's bound is -1, +1)
  • move the old td3_continuous_action.py experiments with Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 to the openrlbenchmark/cleanrl-cache project.
  • Add the new td3_continuous_action_bound_fix.py experiments with Humanoid-v2, Pusher-v2 experiments to https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3--VmlldzoxNjk4Mzk5 (in particular customize the runset name to be CleanRL's td3_continuous_action.py and set the advanced legend to be ${runsetName})
  • update the benchmark file https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/td3.sh to include Humanoid-v2, Pusher-v2, and InvertedPendulum-v2.
  • Add a section in the documentation updating charts for Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 and explaining the use of action scale and bias.

Thank you so much!

@huxiao09
Copy link
Author

ok. @dosssman actually has done exactly what I am going to do. But I'm sorry I may not get you to see my reply in time.
image

@dosssman
Copy link
Collaborator

dosssman commented Jun 25, 2022

Thank you very much for the feedback.
Those two lines are quite a good catch.
Will fix and add re-runs just to be sure.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 25, 2022 via email

@dosssman
Copy link
Collaborator

I changed the run set name to as follows (customize the advanced legend to be ${runsetName}.

image

Could you do the following

  • remove the new td3_continuous_action_bound_fix.py experiments with HalfCheetah-v2 (since its action space's bound is -1, +1)
  • move the old td3_continuous_action.py experiments with Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 to the openrlbenchmark/cleanrl-cache project.
  • Add the new td3_continuous_action_bound_fix.py experiments with Humanoid-v2, Pusher-v2 experiments to https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3--VmlldzoxNjk4Mzk5 (in particular customize the runset name to be CleanRL's td3_continuous_action.py and set the advanced legend to be ${runsetName})
  • update the benchmark file https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/td3.sh to include Humanoid-v2, Pusher-v2, and InvertedPendulum-v2.
  • Add a section in the documentation updating charts for Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 and explaining the use of action scale and bias.

Thank you so much!

No worries. Will do.
I tweak the last two lines @huxiao09 mentioned and re-runs the experiment to be sure there is no side-effect, then clean up the benchmark project structure and update the docs.

Should get it done by the next Tuesday or Wednesday.
Thanks.

@dosssman
Copy link
Collaborator

dosssman commented Jun 25, 2022

Let's just keep those old experiments to compare with the last variant until then.
Once the latest one is validated, I can clean up all the previous variant and re-run "td3_continuous_action.py" to prevent future confusions.

@dosssman
Copy link
Collaborator

dosssman commented Jun 26, 2022

re-runs are looking good so far, not much difference with the previous iterations.
It seems I will have to trouble you regarding the archival / removal of the olds td3 and ddpg runs. Wandb mentions missing permissions when I tried to either move the runs to the cleanrl-cache project or just delete them, before queuing up clean "td3_continuous_actions.py" and "ddpg_continuous_actions.py" runs.

@vwxyzjn Do need some help to remove this runs, as I am not their author.
image

image
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants