Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAC-discrete implementation #270

Merged
merged 42 commits into from
Jan 13, 2023
Merged

Conversation

timoklein
Copy link
Collaborator

@timoklein timoklein commented Aug 29, 2022

Description

Adds the SAC-discrete algorithm as discussed in #266.

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation and previewed the changes via mkdocs serve.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

  • I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have added additional documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers (if applicable).
    • I have added links to the PR related to the algorithm.
    • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves (in PNG format with width=500 and height=300).
    • I have added links to the tracked experiments.
    • I have updated the overview sections at the docs and the repo
  • I have updated the tests accordingly (if applicable).

@vercel
Copy link

vercel bot commented Aug 29, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add your feedback Jan 13, 2023 at 2:28PM (UTC)

cleanrl/sac_atari.py Outdated Show resolved Hide resolved
cleanrl/sac_atari.py Outdated Show resolved Hide resolved
@vwxyzjn vwxyzjn requested a review from dosssman August 31, 2022 20:55
cleanrl/sac_atari.py Outdated Show resolved Hide resolved
cleanrl/sac_atari.py Outdated Show resolved Hide resolved
cleanrl/sac_atari.py Outdated Show resolved Hide resolved
cleanrl/sac_atari.py Outdated Show resolved Hide resolved
@dosssman
Copy link
Collaborator

dosssman commented Sep 1, 2022

Is there any benchmark run for this ? How does it perform ?
Anyways, great job @timoklein

@timoklein
Copy link
Collaborator Author

timoklein commented Sep 6, 2022

The results of the original paper are in #266. There's quite a number of games where its performance at 100k doesn't differ much from a random agent, e.g. Frostbite. I don't think it makes much sense evaluating those.

Right now I just ran it a couple of times manually to see that the code actually works. A modified version of this codebase has been able to solve Catcher (PyGame) and some simple Minigrid environments.

In general, I'm going to try and find some good hyperparameters and then running it on a few environments where performance actually differs from a random agent. Don't know for sure when I have time for that though... Once that's done, I'll put a report in here. Maybe I'm going to run it for 200k steps also to verify the results.

After that plan on starting with some Docs :)

Thanks for helping out with this @Howuhh and @dosssman !

@timoklein
Copy link
Collaborator Author

Posting an update:
I'm running 1m step experiments on Seaquest (takes a while) currently with two versions of the algorithm: One with an implementation as close as possible to cleanrl's SAC implementation and the other which is closer to the paper.
The main difference is the update frequency: CleanRL does a critic update for every learning step, delays the actor updates but compensates for delayed actor updates. As far as I understand SAC-discrete does a single actor and critic update every four steps, not compensating for the delay.

@timoklein
Copy link
Collaborator Author

I ran some experiments on MsPacman and Seaquest.

Here's a link to a report with some results. The entropy regularization coefficient $\alpha$ has a tendency to explode when training longer but I couldn't find other experiments at 1m steps to verify whether that's an issue only in my code. Maybe @Howuhh and @dosssman have an idea?

This implementation doesn't quite match the results of the paper which might be due to not using evaluation mode (i.e. deterministic policy). If it's desired I can implement a test loop evaluating a deterministic policy.

The performance does match the reported values in this other implementation I found.

Are there any other experiments you'd like me to run, e.g. specific environments or more seeds?

I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).

Are there ways I can do this without xvfb? It's not on our group's machines and I don't have sudo.

If everything's fine, I'm gonna start writing Docs :)

@vwxyzjn
Copy link
Owner

vwxyzjn commented Sep 23, 2022

Hi @timoklein, thank you! The experiments look very interesting.

This implementation doesn't quite match the results of the paper which might be due to not using evaluation mode (i.e. deterministic policy). If it's desired I can implement a test loop evaluating a deterministic policy.

Unless it makes a huge difference in the reported results, I wouldn't worry about it.

Are there any other experiments you'd like me to run, e.g. specific environments or more seeds?

Thanks for running the environments for three random seeds, which is great. For my personal interest, I would like to see the results in Pong, Breakout, and BeamRider, which are the common three environments that we used to benchmark other algorithms (see example here).

Are there ways I can do this without xvfb? It's not on our group's machines and I don't have sudo.

If this is an issue, feel free to run the experiments without xvfb. The video recordings are nice to have, but I am not requiring every run to have them. For example, it's not practical to obtain videos in EnvPool, so I didn't do it.

That said, please save the model at the end of training so that we can visualize them later. Feel free to use the following code:

torch.save(agent.state_dict(), f"models/{run_name}/agent.pt")
if args.prod_mode:
    wandb.save(f"models/{run_name}/agent.pt", base_path=f"models/{run_name}", policy="now")

@Howuhh
Copy link
Contributor

Howuhh commented Sep 23, 2022

@timoklein Sometimes target entropy maybe just very high and hard to reach and the loss can explode (as alpha will grow and grow), so usually I tune a bit coefficient (which is 0.98 defualt in code right know)

Some discussion about it:
toshikwa/sac-discrete.pytorch#15 (comment)

@timoklein
Copy link
Collaborator Author

From my point of view this is done now. I ran the new experiments and updated all plots and results. I also deleted all runs that aren't used for the final results from the openRL benchmark project. Docs should also work.
Now surely there'll be some inevitable bugs but then just ping me in the issue and I'm gonna try to fix them :)

@dosssman
Copy link
Collaborator

Great contribution !
I think we can proceed to merging this as it matches baseline results while maintaining a relatively simple implementation.
Other issues, if any exists, will come in light as other people try it out from a fresher perspective.

@dosssman dosssman closed this Dec 19, 2022
@dosssman dosssman reopened this Dec 19, 2022
Copy link

@araffin araffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: the title for the pong figure is not the right one

@timoklein
Copy link
Collaborator Author

minor: the title for the pong figure is not the right one

Thanks, good catch!

@timoklein
Copy link
Collaborator Author

Anything more to do here for me? If this is blocking

The codespell check complains about the name "Jimmy Ba". It doesn't seem to have inline ignores (codespell-project/codespell#1212), so I'm not quite sure how to handle it.

I can just remove the citation so that the CI error is gone.

cleanrl/sac_atari.py Outdated Show resolved Hide resolved
Copy link
Owner

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything LGTM. Feel free to merge after you have resolved the minor target-entropy-scale issue. You should already have contributor access. Thanks so much for this contribution, and sorry for the delay.

I can just remove the citation so that the CI error is gone.

Please keep the citation and just add ignore words in the pre-commit configs instead.

- --ignore-words-list=nd,reacher,thist,ths,magent

@timoklein
Copy link
Collaborator Author

Everything LGTM. Feel free to merge after you have resolved the minor target-entropy-scale issue. You should already have contributor access. Thanks so much for this contribution, and sorry for the delay.

No problem. It's been a fun experience and I learned a lot. Looking forward to contributing more in the future!

@timoklein timoklein closed this Jan 13, 2023
@timoklein timoklein reopened this Jan 13, 2023
@timoklein
Copy link
Collaborator Author

timoklein commented Jan 13, 2023

@vwxyzjn
I'm not quite sure why this still fails
https://github.com/vwxyzjn/cleanrl/actions/runs/3912194839/jobs/6686505080#step:4:92

"Ba" should be correctly added to the pre-commit

- --ignore-words-list=nd,reacher,thist,ths,magent,Ba

I'm going to fix it but it might be Sunday before I get around to doing it.

EDIT: Probably a capitalization issue codespell-project/codespell#2137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants