[RLlib] Fix A2C release tests #27314

kouroshHakha · 2022-07-31T20:52:24Z

Why are these changes needed?

Current A2C’s implementation implies that if microbatch_size is not specified we will fall back to the following pseudocode in the one training_step call (via multi_gpu_train_one_step):

# method A
train_batch = sample_env(train_batch_size)
for mini_batch in BatchIter(train_batch):
	g = compute_grads(mini_batch)
	apply_grads(g)

This is problematic for A2C and here is the reason: A2C is an on-policy algorithm which means that the moment you update the policy’s network (even if it is for one gradient update) you have to re-sample the environment and do the next iteration of the gradient update with the new samples. The code above updates the network per each minibatch iteration which can be the reason why A2C does not learn the game of breakout in our release tests.

Ideally it should be updated according to either of the following rule in each iteration:

use all the samples collected from policy(t-1) to update the policy only once.

# method B
train_batch = sample_env(train_batch_size)
g = compute_grads(train_batch)
apply_grads(g)

break up the sample batch collected from policy (t-1) into minibatches, compute the gradients from mini-batches and apply the average of gradients after all grads are computed. This approach also takes only one gradient update that is calculated based on the entire sample_batch.

# method C
train_batch = sample_env(train_batch_size)
g_list = []
for mini_batch in BatchIter(train_batch):
	g = compute_grads(train_batch)
	g_list.append(g)
apply_grads(g_list.mean())

Sometimes with algorithms like PPO have a loss function which hedges the policy against updates that cause too much divergence compared the old policy (e.g. via some sort of kl divergence penalty). In this kind of algorithms, it is better to take multiple gradient updates per each sample_batch that is collected. Therefore, the following pseudocode will provide the most generic and correct way of updating networks which will work for both A2C and PPO with the correct hyper-parameters.

train_batch = sample_env(train_batch_size)

for grad_update_iter in range(num_grad_updates_per_iter):
	###### this code can be shared across all algos: 
	## Given a train_batch and minibatch_size update the network for only one step
        ## This can effectively be only one iteration if minibatch_size = train_batch_size
	train_batch_iter = BatchIter(train_batch, minibatch_size)
	g_list = []
	for mini_batch in train_batch_iter:
		g = compute_grads(train_batch)
		g_list.append(g)
	apply_grads(g_list.mean())
	#################################################

The first loop sets how many times do we want to update the network parameters, the second loop computes the gradients on the entire train batch but in mini-batch steps to make sure they will all fit in the GPU memory (in case we hit a GPU memory bottleneck). Note that as long as the compute resources allows, different values of minibatch_size should not change the gradient updates, i.e. the gradient computed in minibatch mode should be identical to the gradient computed on the whole dataset.

The plots below shows the difference between method A (gray), method B (red), and method C (green) on A2C. method A does not learn at all, method B is learning and is stable, method C is learning (sort of) but is unstable (we don’t know exactly why though).

These update methods should be considered when we revamp the policy / model API redos but till then, here is the short-term solution to resolve A2C’s release tests.

For the case that we fall back to the default training_step() (i.e. microbatch_size is None) we need to do the following:

By default we will update the train_batch_size to be rollout-fragment x num_workers x num_env_per_worker which is the size of the sample batch size. We will also error out if sgd_minibatch_size < train_batch_size. These set of parameters makes sure method A reduces to method B.
If train_batch_size is setup via config it should be larger than rollout-fragment x num_workers x num_env_per_worker. This will ensure that we will do at most 2 gradient updates per iteration which is not that unstable for A2C.
This case supports multi-gpu because it calls multi_gpu_train_one_step()

For the case that we want to do microbatching (i.e. microbatch_size is not None)

if num_gpus > 1 we should assert an error. This directly uses compute_gradients() which does not support multi-gpu on its own.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This reverts commit 74686a8.

…ase-tests

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

richardliaw · 2022-08-02T17:44:56Z

ALL GREEN

Signed-off-by: Stefan van der Kleij <[email protected]>

kouroshHakha added 11 commits July 19, 2022 17:20

fixed crr flakeyness on crr

74686a8

Revert "fixed crr flakeyness on crr"

5ab160c

This reverts commit 74686a8.

Merge branch 'master' of github.com:ray-project/ray

5019f2d

Merge branch 'master' of github.com:ray-project/ray

a5c07d9

Merge branch 'master' of github.com:ray-project/ray

d3bdfd1

Merge branch 'master' of github.com:ray-project/ray

f1ede75

Merge branch 'master' of github.com:ray-project/ray into fix-a2c-rele…

414bb07

…ase-tests

wip

df663d9

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

782fbaa

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

5abb6d9

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Fix a2c release tests

9fe8d1e

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha requested a review from sven1977 as a code owner July 31, 2022 20:52

kouroshHakha assigned gjoliver Jul 31, 2022

kouroshHakha requested review from gjoliver, avnishn, ArturNiederfahrenhorst and smorad as code owners July 31, 2022 20:52

kouroshHakha assigned avnishn Jul 31, 2022

kouroshHakha requested review from maxpumperla and krfricke as code owners July 31, 2022 20:52

wip

c378db4

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver approved these changes Aug 1, 2022

View reviewed changes

richardliaw merged commit bda5026 into ray-project:master Aug 2, 2022

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

[RLlib] Fix A2C release tests (ray-project#27314)

2214c87

Signed-off-by: Stefan van der Kleij <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Fix A2C release tests #27314

[RLlib] Fix A2C release tests #27314

kouroshHakha commented Jul 31, 2022

richardliaw commented Aug 2, 2022

[RLlib] Fix A2C release tests #27314

[RLlib] Fix A2C release tests #27314

Conversation

kouroshHakha commented Jul 31, 2022

Why are these changes needed?

Related issue number

Checks

richardliaw commented Aug 2, 2022