Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

Closed
ChenDRAG opened this issue Jan 17, 2021 · 5 comments · Fixed by #302, #305, #275, #278 or #280
Closed

Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

ChenDRAG opened this issue Jan 17, 2021 · 5 comments · Fixed by #302, #305, #275, #278 or #280
Assignees
Labels
discussion Discussion of a typical issue

Comments

@ChenDRAG
Copy link
Collaborator

ChenDRAG commented Jan 17, 2021

Purpose

The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing a benchmark(sac, td3, ddpg) on mujoco environments. Some features of tianshou platform will be enhanced along the way.

Introduction

By the time this issue is proposed, tianshou platform has attracted 2.4k star on github, and has become a very popular deep rl library based purely on Pytorch(in contrast with openai baseline, rl lab, etc), thanks to the contributions of @Trinkle23897 @duburcqa, @youkaichao, etc. However, with the users growing day by day, some problems start to spring up. One critical problem is that although Tianshou is a fast speed, structured, flexible library and supports many classic algorithms officially, it has done a relatively poor job on benchmarking the algorithm it supports. Examples and demonstrations are mostly tested on toy environments of gym, and we have not yet provided detailed comparison and analysis with classic papers on officially supported algorithms, which might make users worry about the correctness and efficiency of algorithms, make it a bit hard for researchers using Tianshou to reproduce results of classic papers because of the lack of trustworthy hyperparameters(baseline in other words).

Tianshou hopes to provide users with a lightweight and efficient drl platform and reduce the burden of rl researchers as much as possible. Even if users are only starters and might not be so familiar with drl algorithms or baselines, they can design their own algorithm with minimal lines of code by inheriting and using official data/algorithm structures, understand source code and compare their idea with standard algorithms easily. In order to achieve this, one thing we have to do is to provide a detailed benchmark for widely used algorithms and environments.

This is what I have been trying to do, and the first step has been taken. Using tianshou, I have managed to create a state-of-the-art benchmark on three algorithms on mujoco's mostly widely used 9/14 environments.

ddpg

Environment Tianshou spining up(Pytorch) TD3 paper(ddpg) TD3 paper(our ddpg)
Ant 990.4±4.3 ~840 1005.3 888.8
HalfCheetah 11718.7±465.6 ~11000 3305.6 8577.3
Hopper 2197.0±971.6 ~1800 2020.5 1860.0
Walker2d 1400.6±905.0 ~1950 1843.6 3098.1
Swimmer 144.1±6.5 ~137 N N
Humanoid 177.3±77.6 N N N
Reacher -3.3±0.3 N -6.51 -4.01
InvertedPendulum 1000.0±0.0 N 1000.0 1000.0
InvertedDoublePendulum 8364.3±2778.9 N 9355.5 8370.0

td3

Environment Tianshou spining up(Pytorch) TD3 paper
Ant 5116.4±799.9 ~3800 4372.4±1000.3
HalfCheetah 10201.2±772.8 ~9750 9637.0±859.1
Hopper 3472.2±116.8 ~2860 3564.1±114.7
Walker2d 3982.4±274.5 ~4000 4682.8±539.6
Swimmer 104.2±34.2 ~78 N
Humanoid 5189.5±178.5 N N
Reacher -2.7±0.2 N -3.6±0.6
InvertedPendulum 1000.0±0.0 N 1000.0±0.0
InvertedDoublePendulum 9349.2±14.3 N 9337.5±15.0

sac

Environment Tianshou spining up(Pytorch) SAC paper
Ant 5850.2±475.7 ~3980 ~3720
HalfCheetah 12138.8±1049.3 ~11520 ~10400
Hopper 3542.2±51.5 ~3150 ~3370
Walker2d 5007.0±251.5 ~4250 ~3740
Swimmer 44.4±0.5 ~41.7 N
Humanoid 5488.5±81.2 N ~5200
Reacher -2.6±0.2 N N
InvertedPendulum 1000.0±0.0 N N
InvertedDoublePendulum 9359.5±0.4 N N

* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

**** We didn't compare to OPENAI baselines, because for now I think its benchmark is corrupted(?), and I haven't been able to find the information I need. But in spining up docs they stated that "Spinning Up implementations of DDPG, TD3, and SAC are roughly at-parity with the best-reported results for these algorithms", so I guess lack of comparisons with OPENAI baselines is okay.

figure

I only show one figure here as an example, all other figures for tianshou mujoco benchmark can be found here.

To achieve the results is not easy, because it requires not only hyperparameter tuning, but some features of Tianshou platform have to be changed first, most of which are already mentioned in different issues by different users. For example:

There are also other problems issues haven't mentioned or I haven't noticed. For instance:

  • In trainer, log_interval for update step and env step can only be the same, which will cause inconvenience. A flexible logger might help.
  • In net utils, Net function can only create MLP in which all hidden layer numbers is the same.
  • In policies, some policies will add explore noise when evaluating the algorithm.
  • Buffer and Collector now in Tianshou is a little bit too complex because they try to support all features in one single class, which will cause great inconvenience when trying to understand source code inheriting from those class to create customized data structure.

All the problems above will be taken care of to a certain extent when trying to release the benchmark. Scripts that achieve this benchmark is hosted on my fork of Tianshou, and can be found here. However, it cannot be directly merged, because it is only what we use to demonstrate our idea, so it's not well organized (Lack of consistency, docs, comments, tests, etc.). Another reason is that this will be a big merge on Tianshou and we want to try hard to enhance Tianshou without causing too much interference for our users. As a result, I make a plan and hope to merge all the codes in 6 commits in total in the next few weeks. All of these commits are targeted to releasing the benchmark above eventually.

Plans

Here I briefly introduce what these 6 commits try to do.

  1. In net utils, enhance Net function, and make it support any type of MLP.
  • This is the most urgent commit because Net function will be needed in another pr.
  1. Minor fix of batch and adding a new ReplayBuffer class called CachedReplayBuffer.
  • CachedReplayBuffer is used to replace _cached_buf of Collector in the next commit, which is critical to solving the n_step problem mentioned in Traditional step collector implementation is needed #245.
  • change the definition of ReplayBuffer to certain management of Batch, because chronologically organized ReplayBuffer might not be suitable for all scenarios.
  • give all buffer types inheriting from ReplayBuffer the same API (indexing method for instance), let the developers worry about the underlying implementation of different types of ReplayBuffer, not the users.
  • [Probably] Separation of stack option and other abilities of ReplayBuffer, to make the source code easier to understand or rewrite. Gain efficiency at the same time.
  • docs, tests, etc.
  1. Refactor of collector to support both ReplayBuffer and CachedReplayBuffer.
  • fix Traditional step collector implementation is needed #245 by supporting CachedReplayBuffer and not allowing ReplayBuffer to work when n_env > 1.
  • removed those not widely used return info, make code more lightweight.
  • change BasePolicy to be prepared for the incoming change of indexing method of CachedReplayBuffer.
  • fix a bug in BasePolicy: when ignoring done and setting n_step > 1 in offpolicy algorithms, a small amount of targer q will have calculation error.
  • change the behavior of action noise, expl noise will all be added in collector from now on, making it easier to redefine, less possible to cause bug when added in forward function. Partly sovle Noisy network implementation #194.
  • little change in trainer, to coordinate collector's change.
  • docs, tests, etc.
  1. Refactor of trainer to add a self-defined logger in trainer.
  • add a logger in trainer which can be self-defined and will be used in benchmarking.
  • remove original log_interval, save_fn, writer, etc. (All logging function).
  • add a default logger which basically do all jobs of original logging function. Partly solve Provide curve-drawing examples #161.
  • docs, tests, etc.
  1. Some small fixes in tianshou/policy to make policies easier to use and add some standard tricks to it.
  • take consideration of gym's 'TimeLimit.truncated' flag, to make the policy more efficient.
  1. Releasing mujoco benchmark(source code, data, graphs, detailed comparison, analysis of hyperparameters, etc) on 3 algorithms.

Future work

  1. Remove warnings and implementations for originally supported but now unsupported methods.
  2. Adding support(benchmark in the same way) for other algorithms(VPG, PPO, TRPO, etc).
  3. Speed analysis, and provide a set of hyperparameters that can be trained in Parallel using tianshou to speed up training.
  4. Consider discrete-action environments like Atari (Maybe support rainbow on Tianshou).
  5. A tutorial on how to tune hyperparameters of certain rl problem.
  6. ......
@ChenDRAG ChenDRAG changed the title Plans to releasing mujoco benchmark using ddpg/sac/td3 on Tianshou Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou Jan 17, 2021
@Trinkle23897 Trinkle23897 added the discussion Discussion of a typical issue label Jan 17, 2021
@Trinkle23897
Copy link
Collaborator

In short, the major thing is to move cache_buffer (currently handled in Collector) into the buffer level, to support exactly n_step collect and make the collector cleaner.

I really love the method proposed by @ChenDRAG. He organizes the CachedReplayBuffer as:

| main_buffer | cache_buffer_1 | ... | cache_buffer_n |
|                 a whole batch                       |

where n == number of envs. All of these data are stored in a single (and large) batch. Therefore, we can greatly simplify the original collector's code.

Also, we plan to separate the async collect method to AsyncCollector (inherit from the simplified base collector). Most of the time the user uses sync method for experiments, but the current async code in collector has a lot of overhead. This split of functions will make things cleaner and easier for users to handle.

This was referenced Jan 18, 2021
Trinkle23897 added a commit that referenced this issue Jan 20, 2021
This is the first commit of 6 commits mentioned in #274, which features

1. Refactor of `Class Net` to support any form of MLP.
2. Enable type check in utils.network.
3. Relative change in docs/test/examples.
4. Move atari-related network to examples/atari/atari_network.py

Co-authored-by: Trinkle23897 <[email protected]>
Trinkle23897 added a commit that referenced this issue Jan 29, 2021
This is the second commit of 6 commits mentioned in #274, which features minor refactor of ReplayBuffer and adding two new ReplayBuffer classes called CachedReplayBuffer and ReplayBufferManager. You can check #274 for more detail.

1. Add ReplayBufferManager (handle a list of buffers) and CachedReplayBuffer;
2. Make sure the reserved keys cannot be edited by methods like `buffer.done = xxx`;
3. Add `set_batch` method for manually choosing the batch the ReplayBuffer wants to handle;
4. Add `sample_index` method, same as `sample` but only return index instead of both index and batch data;
5. Add `prev` (one-step previous transition index), `next` (one-step next transition index) and `unfinished_index` (the last modified index whose done==False);
6. Separate `alloc_fn` method for allocating new memory for `self._meta` when a new `(key, value)` pair comes in;
7. Move buffer's documentation to `docs/tutorials/concepts.rst`.

Co-authored-by: n+e <[email protected]>
@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Feb 19, 2021

TODO list after #280:

  • split buffer and collector into several files
  • optimization for batch
  • optimization for atari training -- it is almost half of the speed comparing to 0.3.2
  • docs of tianshou.policy needs to add TOC

Trinkle23897 added a commit that referenced this issue Feb 19, 2021
This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail.

Things changed in this PR:

1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv;
2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.)
3. add policy.exploration_noise(act, batch) -> act
4. small change in BasePolicy.compute_*_returns
5. move reward_metric from collector to trainer
6. fix np.asanyarray issue (different version's numpy will result in different output)
7. flake8 maxlength=88
8. polish docs and fix test

Co-authored-by: n+e <[email protected]>
@ChenDRAG ChenDRAG self-assigned this Feb 19, 2021
ChenDRAG added a commit that referenced this issue Feb 24, 2021
This is the 4th commit of 6 commits mentioned in #274, which features:

1. Use a flexible logger to replace SummaryWriter in trainer.

Co-authored-by: Trinkle23897 <[email protected]>
@ChenDRAG
Copy link
Collaborator Author

ChenDRAG commented Mar 1, 2021

FIrst 5 of 6 commits disscused above is finsihed, I have reproduce mujoco benchmark of some algorithm in some environments. Some results are better, some are worse. Based on results i observe, we can still use the benchmark graph provided above. Perhaps dev branch is ready to be merged into master?

@ChenDRAG
Copy link
Collaborator Author

ChenDRAG commented Mar 1, 2021

example results:
image
image
image

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Mar 1, 2021

Could you please provide the new numerical result here? (based on what you have experimented)

BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024
This is the first commit of 6 commits mentioned in thu-ml#274, which features

1. Refactor of `Class Net` to support any form of MLP.
2. Enable type check in utils.network.
3. Relative change in docs/test/examples.
4. Move atari-related network to examples/atari/atari_network.py

Co-authored-by: Trinkle23897 <[email protected]>
BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024
This is the second commit of 6 commits mentioned in thu-ml#274, which features minor refactor of ReplayBuffer and adding two new ReplayBuffer classes called CachedReplayBuffer and ReplayBufferManager. You can check thu-ml#274 for more detail.

1. Add ReplayBufferManager (handle a list of buffers) and CachedReplayBuffer;
2. Make sure the reserved keys cannot be edited by methods like `buffer.done = xxx`;
3. Add `set_batch` method for manually choosing the batch the ReplayBuffer wants to handle;
4. Add `sample_index` method, same as `sample` but only return index instead of both index and batch data;
5. Add `prev` (one-step previous transition index), `next` (one-step next transition index) and `unfinished_index` (the last modified index whose done==False);
6. Separate `alloc_fn` method for allocating new memory for `self._meta` when a new `(key, value)` pair comes in;
7. Move buffer's documentation to `docs/tutorials/concepts.rst`.

Co-authored-by: n+e <[email protected]>
BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024
This is the third PR of 6 commits mentioned in thu-ml#274, which features refactor of Collector to fix thu-ml#245. You can check thu-ml#274 for more detail.

Things changed in this PR:

1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv;
2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.)
3. add policy.exploration_noise(act, batch) -> act
4. small change in BasePolicy.compute_*_returns
5. move reward_metric from collector to trainer
6. fix np.asanyarray issue (different version's numpy will result in different output)
7. flake8 maxlength=88
8. polish docs and fix test

Co-authored-by: n+e <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment