DDPG/TD3 target_actor output clip #196

huxiao09 · 2022-06-10T08:01:41Z

Problem Description

Hi! It seems that the output of target_actor in DDPG/TD3 has been directly clipped to fit the action range boundaries, without multiplying by max_action. But in Fujimoto's DDPG/TD3 code[1] and some other implementations, the max_action has been add in the last tanh layer of the actor network, so they don't use clip. Have u ever tried the second implementation?

if global_step > args.learning_starts:
                data = rb.sample(args.batch_size)
                with torch.no_grad():
                    next_state_actions = (target_actor(data.next_observations)).clamp(
                        envs.single_action_space.low[0], envs.single_action_space.high[0]
                    )
                    qf1_next_target = qf1_target(data.next_observations, next_state_actions)
                    next_q_value = data.rewards.flatten() + (1 - data.dones.flatten()) * args.gamma * (qf1_next_target).view(-1)

[1] https://github.com/sfujim/TD3

The text was updated successfully, but these errors were encountered:

vwxyzjn · 2022-06-10T16:23:32Z

Thanks for raising this issue. I think you are referring to https://github.com/sfujim/TD3/blob/385b33ac7de4767bab17eb02ade4a268d3e4e24f/TD3.py#L28? Yeah this is a bit of a problem. That said, it's fine in MuJoCo because max_action is just 1.

huxiao09 · 2022-06-11T02:05:00Z

Yep. I still highly suggest you adopt Fujimoto's implementations, because in some MuJoCo environments (like Humanoid-v2) and some other environments the max_action may not be 1.

And I also highly suggest you add the implementation of pytorch network model saving to each of your single file, and then write a separate single file that loads the network model and tests it in the corresponding environment.

Thanks for your reply and your nice cleanRL code : )

vwxyzjn · 2022-06-14T14:48:23Z

Thanks for the suggestion! Fixing the max_action issue with the actor makes sense to me.

Model saving on the other hand is a bit more nuanced. For example, ppo_continuous_action.py technically also needs to save the running mean and standard deviations stored in the environment wrappers. Because of these complications, we have decided not doing model saving to keep the core implementation as simple as possible. Additionally, most people save the models only to load them later to see what the agent is actually doing. For this use case, we already included videos of the agents playing the game in the docs (e.g. link, further eliminating the use case of saving models.

vwxyzjn · 2022-06-20T14:22:33Z

@dosssman This is indeed a much larger problem - I just tested it out and there is a number of environments that do not have 1 as the default max_action.

Testing environment: Ant-v2 Observation space: Box(-inf, inf, (111,), float64) Action space: Box(-1.0, 1.0, (8,), float32)
Testing environment: HalfCheetah-v2 Observation space: Box(-inf, inf, (17,), float64) Action space: Box(-1.0, 1.0, (6,), float32)
Testing environment: Hopper-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (3,), float32)
Testing environment: Humanoid-v2 Observation space: Box(-inf, inf, (376,), float64) Action space: Box(-0.4, 0.4, (17,), float32)
Testing environment: InvertedDoublePendulum-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (1,), float32)
Testing environment: InvertedPendulum-v2 Observation space: Box(-inf, inf, (4,), float64) Action space: Box(-3.0, 3.0, (1,), float32)
Testing environment: Pusher-v2 Observation space: Box(-inf, inf, (23,), float64) Action space: Box(-2.0, 2.0, (7,), float32)
Testing environment: Reacher-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (2,), float32)
Testing environment: Swimmer-v2 Observation space: Box(-inf, inf, (8,), float64) Action space: Box(-1.0, 1.0, (2,), float32)
Testing environment: Walker2d-v2 Observation space: Box(-inf, inf, (17,), float64) Action space: Box(-1.0, 1.0, (6,), float32)

@huxiao09 would you be interested in submitting a fix for this and going through the process of re-running benchmarks? Otherwise, no worries - we will take care of this.

huxiao09 · 2022-06-20T14:36:21Z

Sure, it's my pleasure. But I am busy this week. I'm afraid I can only try my best to finish it in two weeks. Is that too late?

dosssman · 2022-06-20T14:39:15Z

Thanks for the heads up. Will look into this.

vwxyzjn · 2022-06-20T15:10:16Z

Two weeks is actually a great timeline. Thank you @huxiao09 for your interest in working on this!

The process would go like this

file a PR to make the changes with the max_action
we agree on the changes
have you join our wandb team openrlbenchmark that tracks all of our benchmark experiments

run the following benchmark scripts

cleanrl/benchmark/ddpg.sh

Lines 1 to 7 in 6387191

    
           poetry install -E "mujoco pybullet" 
        
           python -c "import mujoco_py" 
        
           OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \ 
        
               --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 \ 
        
               --command "poetry run python cleanrl/ddpg_continuous_action.py --track --capture-video" \ 
        
               --num-seeds 3 \ 
        
               --workers 3

cleanrl/benchmark/td3.sh

Lines 1 to 7 in 6387191

    
           poetry install -E "mujoco pybullet" 
        
           python -c "import mujoco_py" 
        
           OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \ 
        
               --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 \ 
        
               --command "poetry run python cleanrl/td3_continuous_action.py --track --capture-video" \ 
        
               --num-seeds 3 \ 
        
               --workers 3

Btw would this impact sac_continouous_action.py @dosssman? I see in the SAC file there is no clip action like done in PPO:

cleanrl/cleanrl/ppo_continuous_action.py

Line 89 in 6387191

env = gym.wrappers.ClipAction(env)

dosssman · 2022-06-20T15:13:31Z

SAC should be unaffected as the actions are scale to the min, max actions before being returned to the algorithm.

huxiao09 · 2022-06-20T15:41:27Z

ok, get it :)

vwxyzjn · 2022-06-20T15:50:35Z

Thank you so much. Would you mind registering a wandb account and giving me your username? Would you also mind adding me on discord ( Costa#2021 )?

Thanks

huxiao09 · 2022-06-20T16:11:36Z

Yes, I have already had an account: https://wandb.ai/huxiao . But it seems that discord can't be open in my region. I'll find how to solve this problem tomorrow.

dosssman · 2022-06-21T00:21:42Z

Preliminary fix tracked here for TD3 and here for DDPG.

The corresponding PR is #211

Overall, it does not seem to make that big of a difference compared to the previous version, but this version would be theoretically correct indeed.

vwxyzjn · 2022-06-24T13:09:58Z

I changed the run set name to as follows (customize the advanced legend to be ${runsetName}.

Could you do the following

remove the new td3_continuous_action_bound_fix.py experiments with HalfCheetah-v2 (since its action space's bound is -1, +1)
move the old td3_continuous_action.py experiments with Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 to the openrlbenchmark/cleanrl-cache project.
Add the new td3_continuous_action_bound_fix.py experiments with Humanoid-v2, Pusher-v2 experiments to https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3--VmlldzoxNjk4Mzk5 (in particular customize the runset name to be CleanRL's td3_continuous_action.py and set the advanced legend to be ${runsetName})
update the benchmark file https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/td3.sh to include Humanoid-v2, Pusher-v2, and InvertedPendulum-v2.
Add a section in the documentation updating charts for Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 and explaining the use of action scale and bias.

Thank you so much!

huxiao09 · 2022-06-24T14:54:20Z

ok. @dosssman actually has done exactly what I am going to do. But I'm sorry I may not get you to see my reply in time.

dosssman · 2022-06-25T15:00:16Z

Thank you very much for the feedback.
Those two lines are quite a good catch.
Will fix and add re-runs just to be sure.

vwxyzjn · 2022-06-25T15:03:38Z

Thank you Rousslan. Please remove the experiments or move them to cleanrl-cache project - the stale experiments should not remain in the openrlbenchmark/cleanrl.

…

________________________________ From: Rousslan F.J. Dossa ***@***.***> Sent: Saturday, June 25, 2022 11:00:27 AM To: vwxyzjn/cleanrl ***@***.***> Cc: Costa Huang ***@***.***>; Comment ***@***.***> Subject: Re: [vwxyzjn/cleanrl] DDPG/TD3 target_actor output clip (Issue #196) Thank you very much for the feedback. Will fix and add re-runs just to be sure. — Reply to this email directly, view it on GitHub<#196 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABKMJE47Z6HOASSAB26FDY3VQ4NIXANCNFSM5YMZVEJA>. You are receiving this because you commented.Message ID: ***@***.***>

dosssman · 2022-06-25T15:06:36Z

I changed the run set name to as follows (customize the advanced legend to be ${runsetName}.

Could you do the following

remove the new td3_continuous_action_bound_fix.py experiments with HalfCheetah-v2 (since its action space's bound is -1, +1)

move the old td3_continuous_action.py experiments with Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 to the openrlbenchmark/cleanrl-cache project.

Add the new td3_continuous_action_bound_fix.py experiments with Humanoid-v2, Pusher-v2 experiments to https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3--VmlldzoxNjk4Mzk5 (in particular customize the runset name to be CleanRL's td3_continuous_action.py and set the advanced legend to be ${runsetName})

update the benchmark file https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/td3.sh to include Humanoid-v2, Pusher-v2, and InvertedPendulum-v2.

Add a section in the documentation updating charts for Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 and explaining the use of action scale and bias.

Thank you so much!

No worries. Will do.
I tweak the last two lines @huxiao09 mentioned and re-runs the experiment to be sure there is no side-effect, then clean up the benchmark project structure and update the docs.

Should get it done by the next Tuesday or Wednesday.
Thanks.

dosssman · 2022-06-25T15:07:59Z

Let's just keep those old experiments to compare with the last variant until then.
Once the latest one is validated, I can clean up all the previous variant and re-run "td3_continuous_action.py" to prevent future confusions.

dosssman · 2022-06-26T02:01:38Z

re-runs are looking good so far, not much difference with the previous iterations.
It seems I will have to trouble you regarding the archival / removal of the olds td3 and ddpg runs. Wandb mentions missing permissions when I tried to either move the runs to the cleanrl-cache project or just delete them, before queuing up clean "td3_continuous_actions.py" and "ddpg_continuous_actions.py" runs.

@vwxyzjn Do need some help to remove this runs, as I am not their author.

Thanks.

huxiao09 changed the title ~~DDPG/TD3 policy network output clip~~ DDPG/TD3 target_actor output clip Jun 10, 2022

vwxyzjn mentioned this issue Jun 20, 2022

Roadmap for CleanRL #115

Closed

19 tasks

dosssman mentioned this issue Jun 21, 2022

Td3 ddpg action bound fix #211

Merged

12 tasks

vwxyzjn closed this as completed in #211 Jun 30, 2022

dosssman mentioned this issue Aug 16, 2022

Fix for noise sampling for the TD3 exploration #260

Merged

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDPG/TD3 target_actor output clip #196

DDPG/TD3 target_actor output clip #196

huxiao09 commented Jun 10, 2022 •

edited

Loading

vwxyzjn commented Jun 10, 2022

huxiao09 commented Jun 11, 2022

vwxyzjn commented Jun 14, 2022

vwxyzjn commented Jun 20, 2022

huxiao09 commented Jun 20, 2022

dosssman commented Jun 20, 2022

vwxyzjn commented Jun 20, 2022

dosssman commented Jun 20, 2022 via email •

edited

Loading

huxiao09 commented Jun 20, 2022

vwxyzjn commented Jun 20, 2022

huxiao09 commented Jun 20, 2022

dosssman commented Jun 21, 2022 •

edited

Loading

vwxyzjn commented Jun 24, 2022 •

edited

Loading

huxiao09 commented Jun 24, 2022

dosssman commented Jun 25, 2022 •

edited

Loading

vwxyzjn commented Jun 25, 2022 via email

dosssman commented Jun 25, 2022

dosssman commented Jun 25, 2022 •

edited

Loading

dosssman commented Jun 26, 2022 •

edited

Loading

DDPG/TD3 target_actor output clip #196

DDPG/TD3 target_actor output clip #196

Comments

huxiao09 commented Jun 10, 2022 • edited Loading

Problem Description

vwxyzjn commented Jun 10, 2022

huxiao09 commented Jun 11, 2022

vwxyzjn commented Jun 14, 2022

vwxyzjn commented Jun 20, 2022

huxiao09 commented Jun 20, 2022

dosssman commented Jun 20, 2022

vwxyzjn commented Jun 20, 2022

dosssman commented Jun 20, 2022 via email • edited Loading

huxiao09 commented Jun 20, 2022

vwxyzjn commented Jun 20, 2022

huxiao09 commented Jun 20, 2022

dosssman commented Jun 21, 2022 • edited Loading

vwxyzjn commented Jun 24, 2022 • edited Loading

huxiao09 commented Jun 24, 2022

dosssman commented Jun 25, 2022 • edited Loading

vwxyzjn commented Jun 25, 2022 via email

dosssman commented Jun 25, 2022

dosssman commented Jun 25, 2022 • edited Loading

dosssman commented Jun 26, 2022 • edited Loading

huxiao09 commented Jun 10, 2022 •

edited

Loading

dosssman commented Jun 20, 2022 via email •

edited

Loading

dosssman commented Jun 21, 2022 •

edited

Loading

vwxyzjn commented Jun 24, 2022 •

edited

Loading

dosssman commented Jun 25, 2022 •

edited

Loading

dosssman commented Jun 25, 2022 •

edited

Loading

dosssman commented Jun 26, 2022 •

edited

Loading