Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] on-policy rollout collects current "dones" instead of last "dones" #105

Closed
AndyShih12 opened this issue Jul 15, 2020 · 12 comments · Fixed by #110
Closed

[bug] on-policy rollout collects current "dones" instead of last "dones" #105

AndyShih12 opened this issue Jul 15, 2020 · 12 comments · Fixed by #110
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@AndyShih12
Copy link
Contributor

Took me a really long time to debug this, so hopefully this helps others out.

Describe the bug
The on-policy rollout collects last_obs, current reward, current dones. See here

In stable-baselines, the rollout collects last_obs, current reward, and last dones. See here

This messes up the returns and advantage calculations.

I fixed this locally, and PPO improved dramatically on my custom environment (red is before fix, green is after fix).
Imgur

@AndyShih12 AndyShih12 changed the title [bug] on-policy rollouts collects current "dones" instead of last "dones" [bug] on-policy rollout collects current "dones" instead of last "dones" Jul 15, 2020
@Miffyli Miffyli added the bug Something isn't working label Jul 15, 2020
@Miffyli
Copy link
Collaborator

Miffyli commented Jul 15, 2020

Thank you a ton for this! I will implement and test this out on Atari envs, which have also reached lower performance compared to stable-baselines results. My apologies that you had to debug for such a nasty mistake!

I will double-check if the sb3 code is handling dones as in sb. If I include this in an update, would it be because that I include your name in the contributors? Debugging this error out counts as a good contribution, in my eyes :)

@AndyShih12
Copy link
Contributor Author

Sure, that would be appreciated. Thanks for the quick response, hope it helps on the Atari envs too!

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 16, 2020

@AndyShih12

Could you confirm that this is type of fix you implemented? It initializes and stores the dones from previous iteration. I ran this on some Atari games but it did not improve results, but it is uncertain if this fix would help there (I was hoping it would ^^'), and I want to make double-sure I patched it as you did before moving on.

--- a/stable_baselines3/common/base_class.py
+++ b/stable_baselines3/common/base_class.py
@@ -117,6 +117,7 @@ class BaseAlgorithm(ABC):
         self.tensorboard_log = tensorboard_log
         self.lr_schedule = None  # type: Optional[Callable]
         self._last_obs = None  # type: Optional[np.ndarray]
+        self._last_dones = None  # type: Optional[np.ndarray]
         # When using VecNormalize:
         self._last_original_obs = None  # type: Optional[np.ndarray]
         self._episode_num = 0
@@ -444,6 +445,7 @@ class BaseAlgorithm(ABC):
         # Avoid resetting the environment when calling ``.learn()`` consecutive times
         if reset_num_timesteps or self._last_obs is None:
             self._last_obs = self.env.reset()
+            self._last_dones = np.zeros((self._last_obs.shape[0],), dtype=np.bool)
             # Retrieve unnormalized observation for saving into the buffer
             if self._vec_normalize_env is not None:
                 self._last_original_obs = self._vec_normalize_env.get_original_obs()
diff --git a/stable_baselines3/common/on_policy_algorithm.py b/stable_baselines3/common/on_policy_algorithm.py
index 2937b77..52d8573 100644
--- a/stable_baselines3/common/on_policy_algorithm.py
+++ b/stable_baselines3/common/on_policy_algorithm.py
@@ -153,8 +153,9 @@ class OnPolicyAlgorithm(BaseAlgorithm):
             if isinstance(self.action_space, gym.spaces.Discrete):
                 # Reshape in case of discrete action
                 actions = actions.reshape(-1, 1)
-            rollout_buffer.add(self._last_obs, actions, rewards, dones, values, log_probs)
+            rollout_buffer.add(self._last_obs, actions, rewards, self._last_dones, values, log_probs)
             self._last_obs = new_obs
+            self._last_dones = dones

         rollout_buffer.compute_returns_and_advantage(values, dones=dones)

@araffin
Copy link
Member

araffin commented Jul 16, 2020

@AndyShih12 thank you for pointing out that issue.

I fixed this locally, and PPO improved dramatically on my custom environment

Could you tell us from about your custom env and the hyperparameters you used for PPO?
I would expect an environment with either sparse reward or short episodes (~100 timesteps), as the effect should be hard to see on long episodes.

Could you confirm that this is type of fix you implemented?

@Miffyli I was about to ask you to push a draft PR. I think we can fix directly the GAE computation without having to add extra variable, no?

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 16, 2020

I concur this effect is more pronounced in short episodes or reward-at-terminal games, but just to make sure I tested it on Atari.

I was about to ask you to push a draft PR. I think we can fix directly the GAE computation without having to add extra variable, no?

Ok, I will begin a draft on matching PPO/A2C discrete-policy performance with stable-baselines (with Atari games). We can fix it that way in GAE as well, but for that I want to make sure other code does not expect the sb2-style-dones.

@AndyShih12
Copy link
Contributor Author

@Miffyli Yes your patch is the same as what I did locally.

@araffin Indeed, my environment was sparse rewards and (very) short episodes. The reward only came at the last timestep of each episode, the previous code didn't receive any signal at all.

Sorry to hear that it didn't close the gap for Atari games.

@ryankirkpatrick97
Copy link

@araffin and @Miffyli
Could either of you describe to me the reasoning for wanting to use last_dones instead of current dones, besides having it match the original stable-baselines code?

To me it seems more like an issue with compute_returns_and_advantages (especially because the final dones value is passed into that method). Couldn't next_non_terminal be calculated as such:
next_non_terminal = 1.0 - self.dones[step] for all steps?

It also seems to differ from how collecting samples for off-policy agents works, seen here

There's also the case mentioned in this issue, where the dones array after completion of the batch is missing the final dones value, and instead, has a default False value set for the first item.

I have created a test python file, zipped here, that highlights my concern. (Be wary of the ZeroDivisionError lol, just rerun a couple of times)

The environment works as follows. Max episode time is 3 and observations are singular and represent the timestep in the environment. A done=True is thrown when the agent tries to step from timestep 3 to 4. If a batch were collected for 3 full episodes, the final observation array should be as follows:
Obs: [1, 2, 3, 1, 2, 3, 1, 2, 3]

There is, however, a discrepancy between how the actual dones looks and how I expect the dones too look.
Actual dones from PPO: [False, False, False, True, False, False, True, False, False]
Expected dones & how DQN looks: [False, False, True, False, False, True, False, False, True]
Again, issue is that actual dones array has prepended extra False value, and is missing final done value.

TLDR:

  • Revert to storage of current dones in dones array.
  • compute_returns_and_advantages should calculate next_non_terminal = 1.0 - self.dones[step] for all steps

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 27, 2021

@ryankirkpatrick97 Hey! Original motivation was to copy stable-baselines (just to get things working the same way), yes. I think that too stems from OpenAI's baselines.

The discrepancy stems from the bit different ideologies of off/on-policy buffers:

  • Replay buffer stores experiences (s, s', a, r, d), where d is accompanied with s' (note that we can not save an experience for the first step).
  • Rollout buffer stores experiences (s, a, r, d), where d is accompanied with s. With the example environment you tell algorithms to only collect 9 experiences, and it does so.

Semantically, dones in both buffers now have the same meaning: they tell if the accompanied state was first of a new episode*. This also matches with the logic of VecEnvs. Technically this means the first item in dones in rollout-buffers should be True, not False (in practice this is never used).

As for doing such modifications: we might have to leave things as is, because modifying this might break others' code very silently (e.g. they read dones from buffer, expect them to store in current format but suddenly it changes).

PS: I was writing this at 3am so I might be talking out of my butt here. Feel free to correct and bash me for that :').

(*) With VecEnvs, done=True does not mean the accompanied state was terminal, rather it means there was a episode reset between previous and the accompanied state (terminal state is not included).

@ryankirkpatrick97
Copy link

ryankirkpatrick97 commented Mar 27, 2021

@Miffyli thanks for the quick and detailed response! But hahah, hopefully you get a good nights rest!

Apologies, but I seem to have some misconceptions on the ideologies between off/on-policy buffers. I would appreciate if you could help clear them up.

I have read through the VecEnv code enough to understand what you mean about never returning a terminal state, and instead resetting when done is True.

However, I don't understand what you mean when you say that for replay buffers, you store d with accompanied s' and that you cannot save experience for the first step. When a replay buffer stores (s, s', a, r, d), I read that as experience being stored for s and that the d stored in that buffer position corresponds to the transition from s to s'. Why wouldn't that work for the experience for the first step? I have also written pseudo-code for the Replay Buffer, for quick reference:

"""Replay Buffer Pseudo-Code"""
s = env.reset()
while not d:
    a, values, log_probs = policy(s)
    s', r, d, info = env.step(a)

    replay_buffer.add(s, s', a, r, d, values, log_probs)
    s = s'

When a rollout buffer stores (s, a, r, d) I assume that the d represents the same type of transition, but to an s' that isn't stored, since it would be unused.

"""Rollout Buffer Pseudo-Code in my expectation"""
s = env.reset()
while rollout_incomplete:
    a, values, log_probs = policy(s)
    s', r, d, info = env.step(a)

    rollout_buffer.add(s, a, r, d, values, log_probs)
    s = s'
_, last_values, _ = policy(s)
rollout_buffer.compute_returns_and_advantages(last_values)

If you look at my code example, the same 9 observations are traversed for both DQN and PPO, but the resulting dones arrays are different. From what you explained earlier, I understand that the values are shifted, because in the current codebase, they are stored with their accompanied next_state returned from env.step. However, the fact that the dones arrays are different, implies to me that the semantic meaning of the arrays are different.

I believe the semantic meaning should switch to "d comes from s and a and represents the last non-terminal state within a buffer" and then should be stored alongside the corresponding s and a. This would be the same as how rewards are stored. Why should a done value corresponding to reaching a terminal state not be stored with the reward value for reaching that terminal state? This is a little sassy, but with a name like "dones", I expect this over what you described.

I feel that in the long run, the discrepancy between the done arrays will become a greater source of error and confusion, than if the current code base changes and subtly breaks other's code.

Again, I could totally have a misconception here, and I apologize, I wrote this a little too sassy and opinionated than necessary haha. Hopefully I explained my thoughts well. Any direction and clarification is much appreciated, and I'm looking forward to your response and future discussion.

For completeness, here is pseudo-code of the current implementation for Rollout Buffers:

"""Current Rollout Buffer Pseudo-Code"""
d- = False  # or to your comment, correctly True
s  = env.reset()
while rollout_incomplete:
    a, values, log_probs = policy(s)
    s', r, d, info = env.step(a)

    rollout_buffer.add(s, a, r, d-, values, log_probs)
    s  = s'
    d- = d
_, last_values, _ = policy(s)
rollout_buffer.compute_returns_and_advantages(last_values, d)

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 27, 2021

Gah. I needed way too much coffee for this one ^^.
Right, I think I was talking from my sleepy head here. Let me summarize to make sure I understood the point.

  • Replay buffer: The d in experience (s, s', a, r, d) tells the following: "did agent land to a terminal state after state s?"
  • Rollout buffer: The d in experience (s, a, r, d) tells the following: "was state s the first step in the environment?"

You are right, semantically they mean different things. This is because ideologies are bit different in on-policy (e.g. A2C, PPO) and off-policy (e.g. SAC, DQN) worlds:

Sounds like one clarifying thing would be to call dones array in ReplayBuffer next_dones instead, for clarity? I agree confusions like these will hurt in future, but we are rather adamant about backwards compatibility and avoiding modifications that could silently kill somebody's code (one of our pet peeves with other libraries, coughgymcough).

I am sorry if I skipped some parts of your argument. I spent wayyyy to long pondering on this again, and do point me to arguments I might have skipped (I have to dash now). These things are kind of fun to ponder on, so I will give it another try tomorrow :')

Edit: If you feel like discussing these topics more (and about the semantic meanings of data stored in different libraries) etc, RL Discord has bunch of people who I bet would be up for such discussion!

@araffin
Copy link
Member

araffin commented Mar 27, 2021

What about renaming dones in on-policy, episode_starts (or first_steps)? (so we have a different name but we keep the code as-is, as we already spent enough time fixing bugs due to this piece of code)

EDIT: @Miffyli i would need to check the code more carefuly, this idea came after reading your comment

@ryankirkpatrick97
Copy link

@Miffyli, thanks for taking the time to provide clarification!

I think just personally I have spent more time reading code than I have spent looking through the literature, and that's where the confusion came from.

Sounds like that simple rename might be the easiest way to go about altering things. Going to leave the big decisions up to you guys 😉

@Miffyli Miffyli added the documentation Improvements or additions to documentation label Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants