[Bug] Local variable 'values' not updated in the callback for the last timestep #864

rajcscw · 2022-04-15T00:28:32Z

🐛 Bug

In the rollout collection of PPO, the callback references of local variable values is not updated after the value for the last timestep has been predicted.

stable-baselines3/stable_baselines3/common/on_policy_algorithm.py

Line 210 in ed308a7

values = self.policy.predict_values(obs_as_tensor(new_obs, self.device))

Due to this, at the end of the rollout when using callback.on_rollout_end(), locals["values"] may not match with the actual final values of the local variable values

This might lead to undesirable effects for users relying on the last values in the callback.

One simple fix is to update the locals after the self.policy.predict_values(..) by simply calling callback.update_locals(locals()) again as shown below. Further, it is also a better not to replace values and rather use a new variable last_values or next_values to distinguish that it refers to predicted values corresponding to new_obs where as values previously correspond to obs (just to keep in sync with other variables rewards, actions, infos etc)

 with th.no_grad():
    # Compute value for the last timestep
    next_values = self.policy.predict_values(
        obs_as_tensor(new_obs, self.device))

   # Update callback's locals
   callback.update_locals(locals())

rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)

callback.on_rollout_end()

Expected behavior

Local variables inside the callback function are in sync with the actual local variables.

System Info

OS: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-debian-buster-sid #1 SMP Wed Oct 13 17:20:51 UTC 2021
Python: 3.7.11
Stable-Baselines3: 1.4.0
PyTorch: 1.10.2+cu102
GPU Enabled: True
Numpy: 1.21.5
Gym: 0.19.0

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

The text was updated successfully, but these errors were encountered:

Miffyli · 2022-04-15T09:50:27Z

Hey! Thanks for raising this. Next time remember to include minimal code to reproduce things ;)

I agree with the fix, although I would not change the variable names, in case somebody already used these variables in the past (also the last_values=values should be updated). @araffin any callback specific comments on this?

araffin · 2022-04-15T09:57:04Z

Hello,

This might lead to undesirable effects for users relying on the last values in the callback.

I'm curious to know more about your use-case. Callback may not be the appropriate way.

Otherwise, yes calling callback.update_locals(locals()) should solve the issue.

rajcscw · 2022-04-15T11:21:05Z

@Miffyli @araffin Thanks for getting back. Just to be clear. We can just introduce a new variable next_values for the line self.policy.predict_values(obs_as_tensor(new_obs, self.device)) and then update the locals in the callback. So, therefore values will not be changed and users who used this in the past are not affected.

Regarding my Use-case: Once the rollout is completed, I have to re-adjust the rewards in the rollout buffer and re-compute the advantages and returns (for this, I need last_dones and last_values which I can grab from callback.locals). Only way I can do this without modifying code on SB3 is to include this logic inside callback._on_rollout_end() right or is there another way?

araffin · 2022-04-15T12:16:56Z

that's what i thought, you should fork SB3 for your usecase, otherwise you would compute the advantage twice and will have hacks with side effects here and there if you use callbacks.
Or if you just need to adjust rewards, you can use a gym wrapper.

rajcscw · 2022-04-15T16:50:02Z

True, I agree using callbacks is kinda hacky. However, I think that is the only option for me if I do not want to fork SB3 because I work on custom envs that are compatible with Sb3 out of the box (so can plug-in different algorithms easily). Also, there is a cyclic dependency between env (reward ) and policy since policy is required to adjust the rewards. Since I use SubProcEnv for parallel rollouts, it is not efficient to send the policy to env (reward function) frequently due to communication overhead.

Therefore, callbacks is the perfect place where I have access to policy, env and can adjust the rewards, advantages etc.

So while working on this, I noticed that the values were out of sync (the problem I mentioned above). It would be sufficient just to callback.update_locals(locals()) before callback.on_rollout_end(). What do you think @Miffyli @araffin ? I can open a PR for this.

araffin · 2022-04-17T14:09:27Z

If you use callback for accessing both model and env, this is the same as forking, but more hacky...

Since I use SubProcEnv for parallel rollouts, it is not efficient to send the policy to env (reward function) frequently due to communication overhead.

You could do that in the collect_rollout() then.

hat do you think @Miffyli @araffin ? I can open a PR for this.

Please do =) (should be a one line change, please don't forget the changelog)

rajcscw added the bug Something isn't working label Apr 15, 2022

PatrickHelm mentioned this issue Aug 30, 2023

Fixes update_locals() in collect_rollouts() of OnPolicyAlgorithm #1660

Merged

16 tasks

araffin closed this as completed in #1660 Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Local variable 'values' not updated in the callback for the last timestep #864

[Bug] Local variable 'values' not updated in the callback for the last timestep #864

rajcscw commented Apr 15, 2022

Miffyli commented Apr 15, 2022

araffin commented Apr 15, 2022

rajcscw commented Apr 15, 2022

araffin commented Apr 15, 2022

rajcscw commented Apr 15, 2022

araffin commented Apr 17, 2022

[Bug] Local variable 'values' not updated in the callback for the last timestep #864

[Bug] Local variable 'values' not updated in the callback for the last timestep #864

Comments

rajcscw commented Apr 15, 2022

🐛 Bug

Expected behavior

System Info

Checklist

Miffyli commented Apr 15, 2022

araffin commented Apr 15, 2022

rajcscw commented Apr 15, 2022

araffin commented Apr 15, 2022

rajcscw commented Apr 15, 2022

araffin commented Apr 17, 2022