Continuous-SAC-Pytorch

Reproduce results from Continuous SAC paper.

This repo is based on several SAC implementations, mainly Stable-Baselines3, author's implementation and SAC-Continuous-Pytorch.

Update 12/28/24: CrossQ added.

Installation

After cloning the repo, install requirements by running

pip install -r requirements.txt

or it can be installed with pip

pip install git+https://github.com/giangbang/Continuous-SAC.git

How to run

python src/train.py --env_name HalfCheetah-v4 --total_env_step 1000000 --buffer_size 1000000 --actor_log_std_min -20 --batch_size 256 --eval_interval 5000 --critic_tau 0.005 --alpha_lr 3e-4 --num_layers 3 --critic_lr 3e-4 --actor_lr 3e-4 --init_temperature 1 --hidden_dim 256 --reward_scale 1 --upd 1

Some benchmark environments from gym, for example mujoco or RacingCar and LunarLanderContinuous, need to be installed separately from by pip install gymnasium[mujoco] or pip install gymnasium[box2d].

It can also be run from terminal by the following command from the entry point, if installed by setup.py

sac_continuous --env_name HalfCheetah-v4 --total_env_step 1_000_000

add --algo crossq to run with CrossQ.

Results

Most of the experiments used the same hyper-parameters shown in the table. Set seed to -1 to use random seed every run.

Hyper params	Value	Hyper params	Value
`reward_scale`	1.0	`critic_lr`	0.0003
`buffer_size`	1000000	`critic_tau`	0.005
`start_step`	1000	`actor_lr`	0.0003
`total_env_step`	1000000	`actor_log_std_min`	-20.0
`batch_size`	256	`actor_log_std_max`	2
`hidden_dim`	256	`num_layers`	3
`upd`	1	`discount`	0.99
`algo`	(`sac`\|`crossq`)	`init_temperature`	0.2
`eval_interval`	5000	`alpha_lr`	0.0003
`num_eval_episodes`	10	`seed`	-1

SAC and CrossQ results are shown below

Comments

Here are some critical minor implementation details but are crucial to achieve the desired performance;

For SAC:

Handle done separately by truncation and termination. SAC performs much worse in some environment when we do not correctly implement this (about 2k rewards in difference in Half-Cheetah).
Using ReLU activation function slightly increases the performance, compared to using Tanh. I suspect that the three layer Tanh Activation network are not powerful enough to learn the value function of tasks with high reward range like Mujoco.
Using eps=1e-5 in Adam Optimizer does not provide any significant boost as suggested in stable-baselines3.
Initial temperature of alpha (entropy coefficient) can largely impact the final performance (than one might expect). In Half-Cheetah, alpha starting with the values of 0.2 and 1 can yield a gap ~ 1-2k in final performance.
Changing actor_log_std_min from -20 to -10 can sometimes reduce the performance, but this might not be consistent through out seeds

For CrossQ;

Batch Renorm is the key factor for stable training without a target network and is the most tricky part. The running mean/variance of batch renorm should only be recorded when the networks are trained and the batch contains both current and next states. For example, critic should be in eval mode when optimizing the actor. Enabling Batch renorm in actor does not yield good results in my experiments.
CrossQ recommend increases network width, but I keep them the same as SAC.

Computational Resources

All experiments run in this Kaggle notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
plot		plot
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continuous-SAC-Pytorch

Installation

How to run

Results

Comments

Computational Resources

About

Languages

License

giangbang/Continuous-SAC

Folders and files

Latest commit

History

Repository files navigation

Continuous-SAC-Pytorch

Installation

How to run

Results

Comments

Computational Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages