-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Local variable 'values' not updated in the callback for the last timestep #864
Comments
Hey! Thanks for raising this. Next time remember to include minimal code to reproduce things ;) I agree with the fix, although I would not change the variable names, in case somebody already used these variables in the past (also the |
Hello,
I'm curious to know more about your use-case. Callback may not be the appropriate way. Otherwise, yes calling |
@Miffyli @araffin Thanks for getting back. Just to be clear. We can just introduce a new variable Regarding my Use-case: Once the rollout is completed, I have to re-adjust the rewards in the rollout buffer and re-compute the advantages and returns (for this, I need |
that's what i thought, you should fork SB3 for your usecase, otherwise you would compute the advantage twice and will have hacks with side effects here and there if you use callbacks. |
True, I agree using callbacks is kinda hacky. However, I think that is the only option for me if I do not want to fork SB3 because I work on custom envs that are compatible with Sb3 out of the box (so can plug-in different algorithms easily). Also, there is a cyclic dependency between env (reward ) and policy since policy is required to adjust the rewards. Since I use SubProcEnv for parallel rollouts, it is not efficient to send the policy to env (reward function) frequently due to communication overhead. Therefore, callbacks is the perfect place where I have access to policy, env and can adjust the rewards, advantages etc. So while working on this, I noticed that the |
If you use callback for accessing both model and env, this is the same as forking, but more hacky...
You could do that in the
Please do =) (should be a one line change, please don't forget the changelog) |
🐛 Bug
In the rollout collection of PPO, the callback references of local variable
values
is not updated after the value for the last timestep has been predicted.stable-baselines3/stable_baselines3/common/on_policy_algorithm.py
Line 210 in ed308a7
Due to this, at the end of the rollout when using
callback.on_rollout_end()
,locals["values"]
may not match with the actual final values of the local variablevalues
This might lead to undesirable effects for users relying on the last values in the callback.
One simple fix is to update the locals after the
self.policy.predict_values(..)
by simply callingcallback.update_locals(locals())
again as shown below. Further, it is also a better not to replacevalues
and rather use a new variablelast_values
ornext_values
to distinguish that it refers to predicted values corresponding to new_obs where asvalues
previously correspond toobs
(just to keep in sync with other variables rewards, actions, infos etc)Expected behavior
Local variables inside the callback function are in sync with the actual local variables.
System Info
OS: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-debian-buster-sid #1 SMP Wed Oct 13 17:20:51 UTC 2021
Python: 3.7.11
Stable-Baselines3: 1.4.0
PyTorch: 1.10.2+cu102
GPU Enabled: True
Numpy: 1.21.5
Gym: 0.19.0
Checklist
The text was updated successfully, but these errors were encountered: