Soft Actor-Critic #120

araffin · 2018-12-08T11:17:25Z

This PR adds Soft Actor-Critic algorithms and fixes some bugs.
Fixes:

DDPG target network not being saved
DQN prioritized replay buffer parameter not being used

Notes:

I borrowed some code from OpenAI Spinning Up implementation and from Original implementation

Differences with original implementation:

no regularization on policy parameters
usage of entropy coefficient (equivalent to inverse of reward scale), this prevent from having high values in Q-Values losses
default network architecture is [64, 64] and not [256, 256]

docs/modules/sac.rst

hill-a

minor detail, otherwise LGTM

stable_baselines/sac/policies.py

hill-a

Merge from master, at it should be good

The python dependencies needed to be installed beforehand because of the __version__ that was imported

xionghuichen · 2019-10-20T07:28:18Z

Hello, can soft actor-critic reach a similar performance to the original implementation? I test it in Halfcheetah and the final performance is about 11, 000 (14, 000 in published results).

araffin · 2019-10-20T07:45:04Z

Hello,
What hyperparameters did you use? how many training steps? how many seeds? What version of HalfCheetah?
Please look at the rl zoo, where we could get good results on most envs.

xionghuichen · 2019-10-20T08:57:24Z

@araffin Thanks for the reply. I use the same hyperparameters and the time-steps to the original paper. I will check the Halfcheetah-v2 hyperparameters on the page (https://github.com/araffin/rl-baselines-zoo/blob/master/hyperparams/sac.yml) and send you a report as soon as possible.

araffin · 2019-10-20T10:57:40Z

be sure to evaluate the agent with a test env and with 'deterministic=True' (in the predict)

xionghuichen · 2019-10-23T02:58:21Z

@araffin hello, the results of HalfCheetah-v2 is as follows:

Additional remarks:

The experiments run with default parameters directly. In particular, "original-sac" runs with https://github.com/haarnoja/sac/blob/master/examples/mujoco_all_sac.py, and "rl-baselines-zoo-sac" runs with https://github.com/araffin/rl-baselines-zoo/blob/master/train.py;
The results are based on a single seed experiment since the performance variance is small in HalfCheetah-v2;
When evaluation, policy runs with deterministic action.

To my knowledge, there are some differences between our implementation and haarnoja/sac:

params	haarnoja/sac	rl-baselines-zoo
sac_ent_coef	0.2	auto
regularization_coef	0.001	None

However, just fix these differences can not reach haarnoja/sac performance.

Ps: evaluation code of stable-baselines
``

           if self.num_timesteps % 4000 == 0:
                eval_ob = self.eval_env.reset()
                eval_epi_rewards = 0
                eval_epis = 0
                eval_performance = []
                while True:
                    eval_action = self.policy_tf.step(eval_ob[None], deterministic=True).flatten()
                    eval_rescaled_action = eval_action * np.abs(self.action_space.low)
                    eval_new_obs, eval_reward, eval_done, eval_info = self.eval_env.step(eval_rescaled_action)
                    eval_epi_rewards += eval_reward
                    eval_ob = eval_new_obs
                    if eval_done:
                        eval_ob = self.eval_env.reset()
                        eval_performance.append(eval_epi_rewards)
                        eval_epi_rewards = 0
                        eval_epis += 1
                        if eval_epis > 10:
                            break

                logger.record_tabular("eval/performance", np.mean(eval_performance))

Edit:

10.29.2019: modify sac_ent_coef in rl-baselines-zoo (0.1->auto).

araffin · 2019-10-23T06:26:40Z

Default hyperparameters won't give you the best results. You should change the network architecture and batch size to match the paper hyperparameters.
For the entropy coeff, i don't get where you found that value, it is auto by default.

araffin · 2019-10-23T06:28:34Z

For the evaluation, call predict directly

araffin · 2019-10-23T21:16:22Z

If you cannot match the result, please open an issue with the complete steps to reproduce your experiments. Note that this is HalfCheetah-v1 that is used in the SAC repo.

xionghuichen · 2019-10-29T13:21:24Z

@araffin Thank you for the suggestions.

I have also tried the hyper-parameters matched the original paper before, but it still not work. I will open an issue a few days later (a little busy these days :( ).

By the way, although SAC repo test in HalfCheetah-v1, the performance in HalfCheetah-v2 is similar (orange line is tested in v2). entropy-coeff is regarded as alpha^-1. SAC in haarnoja/sac repo set reward scale to a hyper-parameter. alpha=5 in HalfCheetah, so I said sac_ent_coef=0.2 in haarnoja/sac.

xionghuichen · 2019-11-07T07:58:52Z

@araffin Hello, I have good news! Recently, I checked code in two repo and found the critical difference leading to the performance gap.

In haarnoja/sac, the environment runs without TimeLmit wrapper, while we run with it. (ref: https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/envs/gym_env.py#L75)

The read and blue lines are 2 seeds without TimeLimit wrapper, while the pick and grey lines are 2 seeds run with TimeLimit. It's surprising that the Markovian is so important in HalfCheetah!

araffin · 2019-11-07T08:04:46Z

Good news, thanks for the update =)
In fact, SAC works quite well on all envs from the rl zoo, so i was a bit surprised ;)

Yes the markovian assumption is important here for value estimation (i was planning to write a post about time limits in RL too)

xionghuichen · 2019-11-07T08:24:43Z

Cool, look forward to your post!

by the way, do you have any plan to add new algorithms to stable_baselines? I like the code structure and the algorithm implementation of stable_baselines. Maybe I can do some contribution to it.

araffin · 2019-11-07T08:43:29Z

like the code structure and the algorithm implementation of stable_baselines.

thanks =)

Well, if you want to add an algorithm, open an issue and we will discuss it.
But currently, our focus will be on the tf2 migration and other improvements (cf roadmap and milestones).
And contributions are welcomed ;)

xionghuichen · 2019-11-16T16:50:14Z

OK. I will follow your roadmap and find something interested! But how could I track the progress of tf2 migration?

araffin · 2019-11-16T17:03:06Z

there is a wip PR for that as well as an open issue

xionghuichen · 2019-11-16T17:36:01Z

👌 I got it

araffin and others added 18 commits December 1, 2018 14:38

Begin SAC Implementation

9d219a7

Bug fixes in SAC

23d922b

Add SAC to tests

0de53ee

Fix import

bd4466c

Tune SAC defaults params + clean up

42182db

Fix CI test

667d71d

Add documentation

f66da26

Update default params for SAC

2baeb6d

Bug fix + add extra logging

d0902e9

Add more debug info

e3a6a16

Replace reward scale with ent_coeff (its inverse)

51f7101

Fix a bug in DQN where prioritized_replay_beta_iters was not used

b065f9c

Merge branch 'master' into sac

1c5e670

[ci skip] Update default log interval for SAC

51aa3b9

Monitor Entropy + fix train_freq

9c10f00

Fix target params not being saved (#93)

cd831a1

Add comments + implement action proba

65c20d1

Fix potential bug + update default params

9458dba

araffin requested review from ernestum and hill-a December 8, 2018 11:17

araffin added 5 commits December 8, 2018 12:18

[ci skip] Update changelog

0a5d294

Style fixes for codacy

48ae39b

Remove unused variable

0291700

Fix for working with images

f4979b3

Multiprocessing for SAC when using CPU

403ef95

ernestum reviewed Dec 9, 2018

View reviewed changes

docs/modules/sac.rst Outdated Show resolved Hide resolved

hill-a reviewed Dec 10, 2018

View reviewed changes

stable_baselines/sac/policies.py Outdated Show resolved Hide resolved

[ci skip] Doc fix

898fe34

hill-a previously approved these changes Dec 10, 2018

View reviewed changes

Merge branch 'master' into sac

b9573db

Merge branch 'master' into sac

0b84e62

araffin dismissed hill-a’s stale review via 0b84e62 December 11, 2018 17:28

hill-a previously approved these changes Dec 12, 2018

View reviewed changes

Fix error when installing from source

1432b50

The python dependencies needed to be installed beforehand because of the __version__ that was imported

araffin dismissed hill-a’s stale review via 1432b50 December 12, 2018 12:58

hill-a added 2 commits December 13, 2018 09:59

credited @JohannesAck for #125

45b946d

Merge branch 'master' into sac

6049646

araffin mentioned this pull request Dec 13, 2018

Policy loss in sac araffin/robotics-rl-srl#31

Closed

hill-a approved these changes Dec 13, 2018

View reviewed changes

hill-a merged commit c4d41d3 into master Dec 13, 2018

araffin deleted the sac branch December 13, 2018 10:54

araffin mentioned this pull request May 3, 2020

The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840

Closed

araffin mentioned this pull request May 16, 2020

The performance of stable baselines implementation of SAC [question] #861

Closed

araffin mentioned this pull request Oct 27, 2022

[BUG] Vectorized Environment Autoreset Incompatible with openai/baselines' API sail-sg/envpool#194

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft Actor-Critic #120

Soft Actor-Critic #120

araffin commented Dec 8, 2018 •

edited

Loading

hill-a left a comment

hill-a left a comment

xionghuichen commented Oct 20, 2019

araffin commented Oct 20, 2019

xionghuichen commented Oct 20, 2019

araffin commented Oct 20, 2019

xionghuichen commented Oct 23, 2019 •

edited

Loading

araffin commented Oct 23, 2019

araffin commented Oct 23, 2019

araffin commented Oct 23, 2019

xionghuichen commented Oct 29, 2019

xionghuichen commented Nov 7, 2019 •

edited

Loading

araffin commented Nov 7, 2019

xionghuichen commented Nov 7, 2019

araffin commented Nov 7, 2019 •

edited

Loading

xionghuichen commented Nov 16, 2019

araffin commented Nov 16, 2019

xionghuichen commented Nov 16, 2019

Soft Actor-Critic #120

Soft Actor-Critic #120

Conversation

araffin commented Dec 8, 2018 • edited Loading

hill-a left a comment

Choose a reason for hiding this comment

hill-a left a comment

Choose a reason for hiding this comment

xionghuichen commented Oct 20, 2019

araffin commented Oct 20, 2019

xionghuichen commented Oct 20, 2019

araffin commented Oct 20, 2019

xionghuichen commented Oct 23, 2019 • edited Loading

araffin commented Oct 23, 2019

araffin commented Oct 23, 2019

araffin commented Oct 23, 2019

xionghuichen commented Oct 29, 2019

xionghuichen commented Nov 7, 2019 • edited Loading

araffin commented Nov 7, 2019

xionghuichen commented Nov 7, 2019

araffin commented Nov 7, 2019 • edited Loading

xionghuichen commented Nov 16, 2019

araffin commented Nov 16, 2019

xionghuichen commented Nov 16, 2019

araffin commented Dec 8, 2018 •

edited

Loading

xionghuichen commented Oct 23, 2019 •

edited

Loading

xionghuichen commented Nov 7, 2019 •

edited

Loading

araffin commented Nov 7, 2019 •

edited

Loading