Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

ChenDRAG · 2021-01-17T12:59:51Z

Purpose

The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing a benchmark(sac, td3, ddpg) on mujoco environments. Some features of tianshou platform will be enhanced along the way.

Introduction

By the time this issue is proposed, tianshou platform has attracted 2.4k star on github, and has become a very popular deep rl library based purely on Pytorch(in contrast with openai baseline, rl lab, etc), thanks to the contributions of @Trinkle23897 @duburcqa, @youkaichao, etc. However, with the users growing day by day, some problems start to spring up. One critical problem is that although Tianshou is a fast speed, structured, flexible library and supports many classic algorithms officially, it has done a relatively poor job on benchmarking the algorithm it supports. Examples and demonstrations are mostly tested on toy environments of gym, and we have not yet provided detailed comparison and analysis with classic papers on officially supported algorithms, which might make users worry about the correctness and efficiency of algorithms, make it a bit hard for researchers using Tianshou to reproduce results of classic papers because of the lack of trustworthy hyperparameters(baseline in other words).

Tianshou hopes to provide users with a lightweight and efficient drl platform and reduce the burden of rl researchers as much as possible. Even if users are only starters and might not be so familiar with drl algorithms or baselines, they can design their own algorithm with minimal lines of code by inheriting and using official data/algorithm structures, understand source code and compare their idea with standard algorithms easily. In order to achieve this, one thing we have to do is to provide a detailed benchmark for widely used algorithms and environments.

This is what I have been trying to do, and the first step has been taken. Using tianshou, I have managed to create a state-of-the-art benchmark on three algorithms on mujoco's mostly widely used 9/14 environments.

ddpg

Environment	Tianshou	spining up(Pytorch)	TD3 paper(ddpg)	TD3 paper(our ddpg)
Ant	990.4±4.3	~840	1005.3	888.8
HalfCheetah	11718.7±465.6	~11000	3305.6	8577.3
Hopper	2197.0±971.6	~1800	2020.5	1860.0
Walker2d	1400.6±905.0	~1950	1843.6	3098.1
Swimmer	144.1±6.5	~137	N	N
Humanoid	177.3±77.6	N	N	N
Reacher	-3.3±0.3	N	-6.51	-4.01
InvertedPendulum	1000.0±0.0	N	1000.0	1000.0
InvertedDoublePendulum	8364.3±2778.9	N	9355.5	8370.0

td3

Environment	Tianshou	spining up(Pytorch)	TD3 paper
Ant	5116.4±799.9	~3800	4372.4±1000.3
HalfCheetah	10201.2±772.8	~9750	9637.0±859.1
Hopper	3472.2±116.8	~2860	3564.1±114.7
Walker2d	3982.4±274.5	~4000	4682.8±539.6
Swimmer	104.2±34.2	~78	N
Humanoid	5189.5±178.5	N	N
Reacher	-2.7±0.2	N	-3.6±0.6
InvertedPendulum	1000.0±0.0	N	1000.0±0.0
InvertedDoublePendulum	9349.2±14.3	N	9337.5±15.0

sac

Environment	Tianshou	spining up(Pytorch)	SAC paper
Ant	5850.2±475.7	~3980	~3720
HalfCheetah	12138.8±1049.3	~11520	~10400
Hopper	3542.2±51.5	~3150	~3370
Walker2d	5007.0±251.5	~4250	~3740
Swimmer	44.4±0.5	~41.7	N
Humanoid	5488.5±81.2	N	~5200
Reacher	-2.6±0.2	N	N
InvertedPendulum	1000.0±0.0	N	N
InvertedDoublePendulum	9359.5±0.4	N	N

* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

**** We didn't compare to OPENAI baselines, because for now I think its benchmark is corrupted(?), and I haven't been able to find the information I need. But in spining up docs they stated that "Spinning Up implementations of DDPG, TD3, and SAC are roughly at-parity with the best-reported results for these algorithms", so I guess lack of comparisons with OPENAI baselines is okay.

I only show one figure here as an example, all other figures for tianshou mujoco benchmark can be found here.

To achieve the results is not easy, because it requires not only hyperparameter tuning, but some features of Tianshou platform have to be changed first, most of which are already mentioned in different issues by different users. For example:

Increasing training time of a2c #140 Traditional step collector implementation is needed #245 Collector.collect run an entire episode when only set n_step #255 metioned that collector collect whole episodes of data when setting n_step = 1.
(Probably) Unfair Speed Comparison #249 mentioned that speed comparison might be unfair because originally Tianshou uses update step as absciss.
mujoco examples' result mismatch #209 mentioned that original mujoco results can no longer be reproduced because we have changed the code a lot in the past few months.
discussion on Noisy network implementation #194 indicates that some policies officially supported by Tianshou can be refactored to be easier to use.
Provide curve-drawing examples #161 requires curve-drawing examples or tools, which is also urgently needed when creating a benchmark.

There are also other problems issues haven't mentioned or I haven't noticed. For instance:

In trainer, log_interval for update step and env step can only be the same, which will cause inconvenience. A flexible logger might help.
In net utils, Net function can only create MLP in which all hidden layer numbers is the same.
In policies, some policies will add explore noise when evaluating the algorithm.
Buffer and Collector now in Tianshou is a little bit too complex because they try to support all features in one single class, which will cause great inconvenience when trying to understand source code inheriting from those class to create customized data structure.

All the problems above will be taken care of to a certain extent when trying to release the benchmark. Scripts that achieve this benchmark is hosted on my fork of Tianshou, and can be found here. However, it cannot be directly merged, because it is only what we use to demonstrate our idea, so it's not well organized (Lack of consistency, docs, comments, tests, etc.). Another reason is that this will be a big merge on Tianshou and we want to try hard to enhance Tianshou without causing too much interference for our users. As a result, I make a plan and hope to merge all the codes in 6 commits in total in the next few weeks. All of these commits are targeted to releasing the benchmark above eventually.

Plans

Here I briefly introduce what these 6 commits try to do.

In net utils, enhance Net function, and make it support any type of MLP.

This is the most urgent commit because Net function will be needed in another pr.

Minor fix of batch and adding a new ReplayBuffer class called CachedReplayBuffer.

CachedReplayBuffer is used to replace _cached_buf of Collector in the next commit, which is critical to solving the n_step problem mentioned in Traditional step collector implementation is needed #245.
change the definition of ReplayBuffer to certain management of Batch, because chronologically organized ReplayBuffer might not be suitable for all scenarios.
give all buffer types inheriting from ReplayBuffer the same API (indexing method for instance), let the developers worry about the underlying implementation of different types of ReplayBuffer, not the users.
[Probably] Separation of stack option and other abilities of ReplayBuffer, to make the source code easier to understand or rewrite. Gain efficiency at the same time.
docs, tests, etc.

Refactor of collector to support both ReplayBuffer and CachedReplayBuffer.

fix Traditional step collector implementation is needed #245 by supporting CachedReplayBuffer and not allowing ReplayBuffer to work when n_env > 1.
removed those not widely used return info, make code more lightweight.
change BasePolicy to be prepared for the incoming change of indexing method of CachedReplayBuffer.
fix a bug in BasePolicy: when ignoring done and setting n_step > 1 in offpolicy algorithms, a small amount of targer q will have calculation error.
change the behavior of action noise, expl noise will all be added in collector from now on, making it easier to redefine, less possible to cause bug when added in forward function. Partly sovle Noisy network implementation #194.
little change in trainer, to coordinate collector's change.
docs, tests, etc.

Refactor of trainer to add a self-defined logger in trainer.

add a logger in trainer which can be self-defined and will be used in benchmarking.
remove original log_interval, save_fn, writer, etc. (All logging function).
add a default logger which basically do all jobs of original logging function. Partly solve Provide curve-drawing examples #161.
docs, tests, etc.

Some small fixes in tianshou/policy to make policies easier to use and add some standard tricks to it.

take consideration of gym's 'TimeLimit.truncated' flag, to make the policy more efficient.

Releasing mujoco benchmark(source code, data, graphs, detailed comparison, analysis of hyperparameters, etc) on 3 algorithms.

Future work

Remove warnings and implementations for originally supported but now unsupported methods.
Adding support(benchmark in the same way) for other algorithms(VPG, PPO, TRPO, etc).
Speed analysis, and provide a set of hyperparameters that can be trained in Parallel using tianshou to speed up training.
Consider discrete-action environments like Atari (Maybe support rainbow on Tianshou).
A tutorial on how to tune hyperparameters of certain rl problem.
......

The text was updated successfully, but these errors were encountered:

Trinkle23897 · 2021-01-17T13:40:10Z

In short, the major thing is to move cache_buffer (currently handled in Collector) into the buffer level, to support exactly n_step collect and make the collector cleaner.

I really love the method proposed by @ChenDRAG. He organizes the CachedReplayBuffer as:

| main_buffer | cache_buffer_1 | ... | cache_buffer_n |
|                 a whole batch                       |

where n == number of envs. All of these data are stored in a single (and large) batch. Therefore, we can greatly simplify the original collector's code.

Also, we plan to separate the async collect method to AsyncCollector (inherit from the simplified base collector). Most of the time the user uses sync method for experiments, but the current async code in collector has a lot of overhead. This split of functions will make things cleaner and easier for users to handle.

This is the first commit of 6 commits mentioned in #274, which features 1. Refactor of `Class Net` to support any form of MLP. 2. Enable type check in utils.network. 3. Relative change in docs/test/examples. 4. Move atari-related network to examples/atari/atari_network.py Co-authored-by: Trinkle23897 <[email protected]>

This is the second commit of 6 commits mentioned in #274, which features minor refactor of ReplayBuffer and adding two new ReplayBuffer classes called CachedReplayBuffer and ReplayBufferManager. You can check #274 for more detail. 1. Add ReplayBufferManager (handle a list of buffers) and CachedReplayBuffer; 2. Make sure the reserved keys cannot be edited by methods like `buffer.done = xxx`; 3. Add `set_batch` method for manually choosing the batch the ReplayBuffer wants to handle; 4. Add `sample_index` method, same as `sample` but only return index instead of both index and batch data; 5. Add `prev` (one-step previous transition index), `next` (one-step next transition index) and `unfinished_index` (the last modified index whose done==False); 6. Separate `alloc_fn` method for allocating new memory for `self._meta` when a new `(key, value)` pair comes in; 7. Move buffer's documentation to `docs/tutorials/concepts.rst`. Co-authored-by: n+e <[email protected]>

Trinkle23897 · 2021-02-19T00:17:01Z

TODO list after #280:

split buffer and collector into several files
optimization for batch
optimization for atari training -- it is almost half of the speed comparing to 0.3.2
docs of tianshou.policy needs to add TOC

This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <[email protected]>

This is the 4th commit of 6 commits mentioned in #274, which features: 1. Use a flexible logger to replace SummaryWriter in trainer. Co-authored-by: Trinkle23897 <[email protected]>

ChenDRAG · 2021-03-01T08:55:59Z

FIrst 5 of 6 commits disscused above is finsihed, I have reproduce mujoco benchmark of some algorithm in some environments. Some results are better, some are worse. Based on results i observe, we can still use the benchmark graph provided above. Perhaps dev branch is ready to be merged into master?

ChenDRAG · 2021-03-01T08:57:33Z

example results:

Trinkle23897 · 2021-03-01T08:59:45Z

Could you please provide the new numerical result here? (based on what you have experimented)

This is the first commit of 6 commits mentioned in thu-ml#274, which features 1. Refactor of `Class Net` to support any form of MLP. 2. Enable type check in utils.network. 3. Relative change in docs/test/examples. 4. Move atari-related network to examples/atari/atari_network.py Co-authored-by: Trinkle23897 <[email protected]>

This is the second commit of 6 commits mentioned in thu-ml#274, which features minor refactor of ReplayBuffer and adding two new ReplayBuffer classes called CachedReplayBuffer and ReplayBufferManager. You can check thu-ml#274 for more detail. 1. Add ReplayBufferManager (handle a list of buffers) and CachedReplayBuffer; 2. Make sure the reserved keys cannot be edited by methods like `buffer.done = xxx`; 3. Add `set_batch` method for manually choosing the batch the ReplayBuffer wants to handle; 4. Add `sample_index` method, same as `sample` but only return index instead of both index and batch data; 5. Add `prev` (one-step previous transition index), `next` (one-step next transition index) and `unfinished_index` (the last modified index whose done==False); 6. Separate `alloc_fn` method for allocating new memory for `self._meta` when a new `(key, value)` pair comes in; 7. Move buffer's documentation to `docs/tutorials/concepts.rst`. Co-authored-by: n+e <[email protected]>

This is the third PR of 6 commits mentioned in thu-ml#274, which features refactor of Collector to fix thu-ml#245. You can check thu-ml#274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <[email protected]>

ChenDRAG changed the title ~~Plans to releasing mujoco benchmark using ddpg/sac/td3 on Tianshou~~ Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou Jan 17, 2021

Trinkle23897 added the discussion Discussion of a typical issue label Jan 17, 2021

This was referenced Jan 18, 2021

update utils.network #275

Merged

Mujoco td3 benchmark #244

Closed

ChenDRAG mentioned this issue Jan 20, 2021

Add CachedReplayBuffer and ReplayBufferManager #278

Merged

Trinkle23897 mentioned this issue Jan 28, 2021

offpolicy_trainer tqdm update question #254

Closed

8 tasks

ChenDRAG mentioned this issue Jan 28, 2021

Step collector implementation #280

Merged

ChenDRAG self-assigned this Feb 19, 2021

This was referenced Feb 19, 2021

Trainer refactor : some definition change #293

Merged

Trainer refactor : flexible logger #295

Merged

Add Timelimit trick to optimize policies #296

Merged

Trinkle23897 mentioned this issue Mar 1, 2021

merge dev to master #302

Merged

Trinkle23897 linked a pull request Mar 1, 2021 that will close this issue

merge dev to master #302

Merged

Trinkle23897 closed this as completed in #302 Mar 2, 2021

Trinkle23897 reopened this Mar 2, 2021

ChenDRAG mentioned this issue Mar 7, 2021

mujoco benchmark -- ddpg, td3, sac #305

Merged

Trinkle23897 linked a pull request Mar 7, 2021 that will close this issue

mujoco benchmark -- ddpg, td3, sac #305

Merged

Trinkle23897 pinned this issue Mar 7, 2021

Trinkle23897 closed this as completed in #305 Mar 7, 2021

ChenDRAG mentioned this issue Mar 8, 2021

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

Closed

ChenDRAG unpinned this issue Mar 21, 2021

ChenDRAG mentioned this issue Apr 14, 2021

Plans of implementing 3 classic model-free algorithm (TRPO/TNPG/ACKTR), and benchmarking them in mujoco settings #338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

ChenDRAG commented Jan 17, 2021 •

edited by Trinkle23897

Loading

Trinkle23897 commented Jan 17, 2021

Trinkle23897 commented Feb 19, 2021 •

edited

Loading

ChenDRAG commented Mar 1, 2021

ChenDRAG commented Mar 1, 2021 •

edited

Loading

Trinkle23897 commented Mar 1, 2021 •

edited

Loading

Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

Comments

ChenDRAG commented Jan 17, 2021 • edited by Trinkle23897 Loading

Purpose

Introduction

ddpg

td3

sac

Plans

Future work

Trinkle23897 commented Jan 17, 2021

Trinkle23897 commented Feb 19, 2021 • edited Loading

ChenDRAG commented Mar 1, 2021

ChenDRAG commented Mar 1, 2021 • edited Loading

Trinkle23897 commented Mar 1, 2021 • edited Loading

ChenDRAG commented Jan 17, 2021 •

edited by Trinkle23897

Loading

Trinkle23897 commented Feb 19, 2021 •

edited

Loading

ChenDRAG commented Mar 1, 2021 •

edited

Loading

Trinkle23897 commented Mar 1, 2021 •

edited

Loading