Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Fix A2C release tests #27314

Merged
merged 12 commits into from
Aug 2, 2022

Conversation

kouroshHakha
Copy link
Contributor

Why are these changes needed?

Current A2C’s implementation implies that if microbatch_size is not specified we will fall back to the following pseudocode in the one training_step call (via multi_gpu_train_one_step):

# method A
train_batch = sample_env(train_batch_size)
for mini_batch in BatchIter(train_batch):
	g = compute_grads(mini_batch)
	apply_grads(g)

This is problematic for A2C and here is the reason: A2C is an on-policy algorithm which means that the moment you update the policy’s network (even if it is for one gradient update) you have to re-sample the environment and do the next iteration of the gradient update with the new samples. The code above updates the network per each minibatch iteration which can be the reason why A2C does not learn the game of breakout in our release tests.

Ideally it should be updated according to either of the following rule in each iteration:

  1. use all the samples collected from policy(t-1) to update the policy only once.
# method B
train_batch = sample_env(train_batch_size)
g = compute_grads(train_batch)
apply_grads(g)
  1. break up the sample batch collected from policy (t-1) into minibatches, compute the gradients from mini-batches and apply the average of gradients after all grads are computed. This approach also takes only one gradient update that is calculated based on the entire sample_batch.
# method C
train_batch = sample_env(train_batch_size)
g_list = []
for mini_batch in BatchIter(train_batch):
	g = compute_grads(train_batch)
	g_list.append(g)
apply_grads(g_list.mean())

Sometimes with algorithms like PPO have a loss function which hedges the policy against updates that cause too much divergence compared the old policy (e.g. via some sort of kl divergence penalty). In this kind of algorithms, it is better to take multiple gradient updates per each sample_batch that is collected. Therefore, the following pseudocode will provide the most generic and correct way of updating networks which will work for both A2C and PPO with the correct hyper-parameters.

train_batch = sample_env(train_batch_size)

for grad_update_iter in range(num_grad_updates_per_iter):
	###### this code can be shared across all algos: 
	## Given a train_batch and minibatch_size update the network for only one step
        ## This can effectively be only one iteration if minibatch_size = train_batch_size
	train_batch_iter = BatchIter(train_batch, minibatch_size)
	g_list = []
	for mini_batch in train_batch_iter:
		g = compute_grads(train_batch)
		g_list.append(g)
	apply_grads(g_list.mean())
	#################################################

The first loop sets how many times do we want to update the network parameters, the second loop computes the gradients on the entire train batch but in mini-batch steps to make sure they will all fit in the GPU memory (in case we hit a GPU memory bottleneck). Note that as long as the compute resources allows, different values of minibatch_size should not change the gradient updates, i.e. the gradient computed in minibatch mode should be identical to the gradient computed on the whole dataset.

The plots below shows the difference between method A (gray), method B (red), and method C (green) on A2C. method A does not learn at all, method B is learning and is stable, method C is learning (sort of) but is unstable (we don’t know exactly why though).

Untitled

These update methods should be considered when we revamp the policy / model API redos but till then, here is the short-term solution to resolve A2C’s release tests.

For the case that we fall back to the default training_step() (i.e. microbatch_size is None) we need to do the following:

  • By default we will update the train_batch_size to be rollout-fragment x num_workers x num_env_per_worker which is the size of the sample batch size. We will also error out if sgd_minibatch_size < train_batch_size. These set of parameters makes sure method A reduces to method B.
  • If train_batch_size is setup via config it should be larger than rollout-fragment x num_workers x num_env_per_worker. This will ensure that we will do at most 2 gradient updates per iteration which is not that unstable for A2C.
  • This case supports multi-gpu because it calls multi_gpu_train_one_step()

For the case that we want to do microbatching (i.e. microbatch_size is not None)

  • if num_gpus > 1 we should assert an error. This directly uses compute_gradients() which does not support multi-gpu on its own.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@richardliaw richardliaw merged commit bda5026 into ray-project:master Aug 2, 2022
@richardliaw
Copy link
Contributor

ALL GREEN

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants