-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch PPO Implementation #295
Conversation
This reverts commit a2b4c8a.
This commit adds # NOQA comments to some top-level imports in order to please flake8 (specifically E402) Signed-off-by: ljvmiranda921 <[email protected]>
This commit adds a batch implementation of the Proximal Policy Optimization algorithm. It is meant to interact with the VecEnv environment in envs.vec_env.py. batch_act() and batch_observe() methods are implemented to achieve this task. Signed-off-by: ljvmiranda921 <[email protected]>
This commit adds a train_agent_batch in the experiments module Signed-off-by: ljvmiranda921 <[email protected]>
This commit adds another test class, TestBatchPPO, to test the batch implementation of PPO. Signed-off-by: ljvmiranda921 <[email protected]>
This commit adds a gym example for batch PPO implementation. Signed-off-by: Lester James V. Miranda <[email protected]>
This commit refactors `batch_act` and `batch_observe` into: `batch_act_and_train` and `batch_observe_and_train`. There's also an additional set of `batch_act` and `batch_observe` methods implemented. The idea is that during training, we use all the `*_and_train` methods, similar to the standard API. Then we call its counterparts during testing/evaluation. Signed-off-by: Lester James V. Miranda <[email protected]>
This commit fixes the BatchPPO algorithm by applying a set of accumulators to keep the episode memories in check and prevent leaking episodes without advantage computations. A deque was also implemented in train_agent_batch to control the resolution of the reported mean_r Signed-off-by: Lester James V. Miranda <[email protected]>
This commit implements the BatchEvaluator and updates the train_agent_batch so that the agent is evaluated at some timestep. Signed-off-by: Lester James V. Miranda <[email protected]>
This commit fixes the return value of _batch_act in order to handle the bug in Pendulum-v0 Signed-off-by: Lester James V. Miranda <[email protected]>
examples/gym/train_ppo_batch_gym.py
Outdated
env = gym.make(args.env) | ||
# Use different random seeds for train and test envs | ||
env_seed = 2 ** 32 - 1 - args.seed if test else args.seed | ||
env.seed(env_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each env in VectorEnv should be assigned different random seeds. See train_a3c_gym.py
for how to assign different random seeds.
# Start new episode for those with mask | ||
episode_r *= masks | ||
episode_len *= masks | ||
t += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t
should be the total number of transitions experienced so far. So, it should increase by num_envs
, not by 1. By doing so, we can keep the other hyperparameters and only change num_envs
to trade cpus with computation time.
Refactor branch: https://github.com/ljvmiranda921/chainerrl/pull/13
Description
This PR is built on top of @iory 's A2C implementation in #149. It provides a batch/parallel implementation of Proximal Policy Optimization. I'm using #149's
VecEnv
environment to achieve this task. Here are the main changes:batch_act()
andbatch_observe()
methodschainerrl.experiments.train_agent_batch()
Status: In progressChanges in data structure
Previously, the computation resides in the
self.memory
andself.last_episode
attributes of PPO. Now we're also usingself.batch_memory
to handle this task. During a batch run, the type signature looks like:The same goes for
self.last_episode
:New methods:
batch_act()
andbatch_observe()
These methods are supposed to handle all batch computations during the run. The method
batch_act(batch_obs)
performs a set of actions given a set of observations, whilebatch_observe(obs, r, done, info)
updates the model. A simple way to use them can be seen below:This assumes that the environment in
VectorEnv
sends areset
signal in the form of a dictionary entry ininfo
.