Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Do stable-baselines implementations include improvements that are discovered after the paper is released? #244

Closed
2 tasks done
batu opened this issue Nov 25, 2020 · 6 comments
Labels
question Further information is requested

Comments

@batu
Copy link
Contributor

batu commented Nov 25, 2020

Hello,

This question popped into my mind after seeing @araffin's very cool tweet about the effects of an epsilon value on training performance:
image

I fully understand that stable-baselines are meant to be a stable set of baselines, so in one sense adding improvements over the base paper makes the historically accurate comparison more difficult. On the other hand, I use sb both as baselines, and when I am trying to solve my RL problems and answer other research (maybe this is not the intended use case but the library is really good what can I say). For those use cases, I would definitely want the improvements to be included in the algorithms! Especially for those "state of the art" (loosely PPO/SAC).

What is the team's position on this? I think both are perfectly valid approaches, I am just curious if we can expect more robust improvements (such as those suggested in here) to be included as "educated defaults".

Thank you for the great work

Checklist

  • I have read the documentation (required)
  • I have checked that there is no similar issue in the repo (required)
@batu batu added the question Further information is requested label Nov 25, 2020
@Miffyli
Copy link
Collaborator

Miffyli commented Nov 25, 2020

See the full discussion here, especially #110 (comment) . That discussion is on differences of RMSProp between TF and PyTorch, but similar things apply to the epsilon.

As with many things in RL, it is not universally better to use one epsilon over other. With smaller epsilon (1e-7) the agents did learn faster, however they were not as stable (IIRC I had CartPole results where the agent did not quite reach maximum 500 episodic reward), while with larger epsilon the training took longer but eventually resulted in stable results. What is better option for you depends on your task, and sadly you need to try them out to see which seems to perform better ^^'.

The lesson by @araffin was to point out how just very trivial/small looking parameters can have such a large impact on training in RL... Not too fun.

@batu
Copy link
Contributor Author

batu commented Nov 25, 2020

You are definitely right, it is almost impossible to find "strictly better" improvements across different environments. I think however papers like this give a more educated default than the initial paper implementation defaults. Is SB3 acting on these more educated defaults?

Even though I heavily leaned into the epsilon in my question (kinda because the impact of it is... scary) I am curious about the general philosophy of SB3.

@Miffyli
Copy link
Collaborator

Miffyli commented Nov 25, 2020

I am curious about the general philosophy of SB3.

Ah, right, sorry for missing this part! The goal here is to replicate the reference results as truthfully as possible, given the same training parameters. The default parameters are usually a "good" guess for smaller, robot-like tasks (e.g. A2C uses PyTorch RMSProp with eps 1e-5 by default), and good/optimal hyperparameters for other environments are tuned and shared in the zoo repo. This is the baselines part, and as you pointed out this is to allow others to trust that algorithms they use from here reflect what the algorithm's authors originally intended.

The "stable" part comes from backwards compatibility (e.g. you update the library and it still should work) and overall cleanliness of the implementations. We prefer to fix, clarify and extend existing implementations rather than adding new features to avoid complicating things too much too quick (contrib repo is designed to help with this). Extensions include e.g. support for more observation/action spaces, improved utilities for evaluation, etc.

@batu
Copy link
Contributor Author

batu commented Nov 25, 2020

That's fair and is more in line with the goals of the library.

I just want you to do all the hard work, and for me to get higher performance via a simple update of the library!

Might be interesting to have a state_of_the_art implementation of the algorithms, maybe in contrib, that collects these type of small implementation level improvements that are discovered and validated post paper release. I am thinking of it more as "collecting better default implementation choices and tricks", not adding every new feature that gets published on Arxiv, but of course, even that is a lot of work and requires validation.

I will keep an eye out and if I end up using and validating some of the changes I am talking about in sb I will share the results and we can take a look at it.

Thank you for taking the time to answer!

@batu batu closed this as completed Nov 25, 2020
@araffin
Copy link
Member

araffin commented Nov 25, 2020

. I think however papers like this give a more educated default than the initial paper implementation defaults. Is SB3 acting on these more educated defaults?

most of the things described in that paper are in fact implemented for PPO/A2C ;)
for instance, small initial weights for the value, orthogonal initialization, ...
and for others, they are present in the zoo, for example the good default value for gae_lambda for PPO on continuous control tasks.

I just want you to do all the hard work, and for me to get higher performance via a simple update of the library!

those improvements are usually problem-dependent... don't expect to have a silver bullet (and there is no free lunch ;) )

@batu
Copy link
Contributor Author

batu commented Nov 25, 2020

Great! That's what I wanted to hear. There might not be free lunches but surely there are free snacks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants