-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Fix A2C release tests #27314
Merged
richardliaw
merged 12 commits into
ray-project:master
from
kouroshHakha:fix-a2c-release-tests
Aug 2, 2022
Merged
[RLlib] Fix A2C release tests #27314
richardliaw
merged 12 commits into
ray-project:master
from
kouroshHakha:fix-a2c-release-tests
Aug 2, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This reverts commit 74686a8.
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
gjoliver
approved these changes
Aug 1, 2022
ALL GREEN |
Stefan-1313
pushed a commit
to Stefan-1313/ray_mod
that referenced
this pull request
Aug 18, 2022
Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Current A2C’s implementation implies that if
microbatch_size
is not specified we will fall back to the following pseudocode in the one training_step call (viamulti_gpu_train_one_step
):This is problematic for A2C and here is the reason: A2C is an on-policy algorithm which means that the moment you update the policy’s network (even if it is for one gradient update) you have to re-sample the environment and do the next iteration of the gradient update with the new samples. The code above updates the network per each minibatch iteration which can be the reason why A2C does not learn the game of breakout in our release tests.
Ideally it should be updated according to either of the following rule in each iteration:
Sometimes with algorithms like PPO have a loss function which hedges the policy against updates that cause too much divergence compared the old policy (e.g. via some sort of kl divergence penalty). In this kind of algorithms, it is better to take multiple gradient updates per each sample_batch that is collected. Therefore, the following pseudocode will provide the most generic and correct way of updating networks which will work for both A2C and PPO with the correct hyper-parameters.
The first loop sets how many times do we want to update the network parameters, the second loop computes the gradients on the entire train batch but in mini-batch steps to make sure they will all fit in the GPU memory (in case we hit a GPU memory bottleneck). Note that as long as the compute resources allows, different values of minibatch_size should not change the gradient updates, i.e. the gradient computed in minibatch mode should be identical to the gradient computed on the whole dataset.
The plots below shows the difference between method A (gray), method B (red), and method C (green) on A2C. method A does not learn at all, method B is learning and is stable, method C is learning (sort of) but is unstable (we don’t know exactly why though).
These update methods should be considered when we revamp the policy / model API redos but till then, here is the short-term solution to resolve A2C’s release tests.
For the case that we fall back to the default training_step() (i.e.
microbatch_size is None
) we need to do the following:sgd_minibatch_size < train_batch_size
. These set of parameters makes sure method A reduces to method B.multi_gpu_train_one_step()
For the case that we want to do microbatching (i.e.
microbatch_size is not None
)num_gpus > 1
we should assert an error. This directly usescompute_gradients()
which does not support multi-gpu on its own.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.