-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soft Actor-Critic #120
Soft Actor-Critic #120
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor detail, otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge from master, at it should be good
The python dependencies needed to be installed beforehand because of the __version__ that was imported
Hello, can soft actor-critic reach a similar performance to the original implementation? I test it in Halfcheetah and the final performance is about 11, 000 (14, 000 in published results). |
Hello, |
@araffin Thanks for the reply. I use the same hyperparameters and the time-steps to the original paper. I will check the Halfcheetah-v2 hyperparameters on the page (https://github.com/araffin/rl-baselines-zoo/blob/master/hyperparams/sac.yml) and send you a report as soon as possible. |
be sure to evaluate the agent with a test env and with 'deterministic=True' (in the predict) |
@araffin hello, the results of HalfCheetah-v2 is as follows: Additional remarks:
To my knowledge, there are some differences between our implementation and haarnoja/sac:
However, just fix these differences can not reach haarnoja/sac performance. Ps: evaluation code of stable-baselines
Edit:
|
Default hyperparameters won't give you the best results. You should change the network architecture and batch size to match the paper hyperparameters. |
For the evaluation, call predict directly |
If you cannot match the result, please open an issue with the complete steps to reproduce your experiments. Note that this is |
@araffin Thank you for the suggestions. I have also tried the hyper-parameters matched the original paper before, but it still not work. I will open an issue a few days later (a little busy these days :( ). By the way, although SAC repo test in HalfCheetah-v1, the performance in HalfCheetah-v2 is similar (orange line is tested in v2). entropy-coeff is regarded as alpha^-1. SAC in haarnoja/sac repo set reward scale to a hyper-parameter. alpha=5 in HalfCheetah, so I said sac_ent_coef=0.2 in haarnoja/sac. |
@araffin Hello, I have good news! Recently, I checked code in two repo and found the critical difference leading to the performance gap. In haarnoja/sac, the environment runs without TimeLmit wrapper, while we run with it. (ref: https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/envs/gym_env.py#L75)
|
Good news, thanks for the update =) Yes the markovian assumption is important here for value estimation (i was planning to write a post about time limits in RL too) |
Cool, look forward to your post! by the way, do you have any plan to add new algorithms to stable_baselines? I like the code structure and the algorithm implementation of stable_baselines. Maybe I can do some contribution to it. |
thanks =) Well, if you want to add an algorithm, open an issue and we will discuss it. |
OK. I will follow your roadmap and find something interested! But how could I track the progress of tf2 migration? |
there is a wip PR for that as well as an open issue |
👌 I got it |
This PR adds Soft Actor-Critic algorithms and fixes some bugs.
Fixes:
Notes:
Differences with original implementation: