Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch PPO Implementation #295

Merged
merged 38 commits into from
Nov 12, 2018
Merged

Batch PPO Implementation #295

merged 38 commits into from
Nov 12, 2018

Conversation

ljvmiranda921
Copy link
Contributor

@ljvmiranda921 ljvmiranda921 commented Aug 9, 2018

Refactor branch: https://github.com/ljvmiranda921/chainerrl/pull/13

Description

This PR is built on top of @iory 's A2C implementation in #149. It provides a batch/parallel implementation of Proximal Policy Optimization. I'm using #149's VecEnv environment to achieve this task. Here are the main changes:

  • Add batch implementation to PPO via the batch_act() and batch_observe() methods
  • Add chainerrl.experiments.train_agent_batch() Status: In progress
  • Add new tests for batch PPO Status: In progress

Changes in data structure

Previously, the computation resides in the self.memory and self.last_episode attributes of PPO. Now we're also using self.batch_memory to handle this task. During a batch run, the type signature looks like:

# Type signature for self.batch_memory during batch run
batch_memory :: [[dict], [dict], [dict]]
where len(batch_memory) == batch_env.num_envs

The same goes for self.last_episode:

# Type signature for self.last_episode
last_episode :: [[dict], [dict], [dict]]
where len(last_episode) == batch_env.num_envs

New methods: batch_act() and batch_observe()

These methods are supposed to handle all batch computations during the run. The method batch_act(batch_obs) performs a set of actions given a set of observations, while batch_observe(obs, r, done, info) updates the model. A simple way to use them can be seen below:

t = 0; steps = 100
# o_0, r_0 : Init observation and reward
obs = batch_env.reset()
r = np.zeros(num_process, dtype='f')
# Initialize episode reward
episode_r = np.zeros(num_process, dtype='f')

while t < steps:
    # a_t : First action
    action = agent.batch_act(obs)
    # o_{t+1}, r_{t+1} : Get observation and reward
    obs, r, done, info = batch_env.step(action)
    # Train model
    agent.batch_observe(obs, r, done, info)
    # Update counters
    t += 1
    update_or_reset_reward(episode_r, done, info)

This assumes that the environment in VectorEnv sends a reset signal in the form of a dictionary entry in info.

halfcheetah-v2-5-finish
hopper-v2-5-finish
reacher-v2-5-finish
cartpole-v1
halfcheetah-v2
hopper-v2
pendulum-v0
reacher-v2

@ljvmiranda921 ljvmiranda921 changed the title Batch PPO Implementation [WIP] Batch PPO Implementation Aug 9, 2018
ljvmiranda921 and others added 7 commits August 9, 2018 16:28
This commit adds # NOQA comments to some top-level imports
in order to please flake8 (specifically E402)

Signed-off-by: ljvmiranda921 <[email protected]>
This commit adds a batch implementation of the Proximal Policy
Optimization algorithm. It is meant to interact with the VecEnv
environment in envs.vec_env.py.

batch_act() and batch_observe() methods are implemented to achieve
this task.

Signed-off-by: ljvmiranda921 <[email protected]>
This commit adds a train_agent_batch in the experiments module

Signed-off-by: ljvmiranda921 <[email protected]>
This commit adds another test class, TestBatchPPO, to test
the batch implementation of PPO.

Signed-off-by: ljvmiranda921 <[email protected]>
This commit adds a gym example for batch PPO implementation.

Signed-off-by: Lester James V. Miranda <[email protected]>
This commit refactors `batch_act` and `batch_observe` into:
`batch_act_and_train` and `batch_observe_and_train`. There's also
an additional set of `batch_act` and `batch_observe` methods implemented.

The idea is that during training, we use all the `*_and_train` methods, similar
to the standard API. Then we call its counterparts during testing/evaluation.

Signed-off-by: Lester James V. Miranda <[email protected]>
ljvmiranda921 and others added 6 commits September 5, 2018 10:23
This commit fixes the BatchPPO algorithm by applying a set of accumulators
to keep the episode memories in check and prevent leaking episodes without advantage
computations.

A deque was also implemented in train_agent_batch to control the resolution of the
reported mean_r

Signed-off-by: Lester James V. Miranda <[email protected]>
This commit implements the BatchEvaluator and updates the train_agent_batch
so that the agent is evaluated at some timestep. 

Signed-off-by: Lester James V. Miranda <[email protected]>
This commit fixes the return value of _batch_act in order to handle
the bug in Pendulum-v0

Signed-off-by: Lester James V. Miranda <[email protected]>
env = gym.make(args.env)
# Use different random seeds for train and test envs
env_seed = 2 ** 32 - 1 - args.seed if test else args.seed
env.seed(env_seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each env in VectorEnv should be assigned different random seeds. See train_a3c_gym.py for how to assign different random seeds.

# Start new episode for those with mask
episode_r *= masks
episode_len *= masks
t += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t should be the total number of transitions experienced so far. So, it should increase by num_envs, not by 1. By doing so, we can keep the other hyperparameters and only change num_envs to trade cpus with computation time.

@ljvmiranda921 ljvmiranda921 changed the title [WIP] Batch PPO Implementation Batch PPO Implementation Sep 13, 2018
@toslunar toslunar merged commit b26aa88 into chainer:master Nov 12, 2018
@muupan muupan added this to the v0.5 milestone Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants