[bug] on-policy rollout collects current "dones" instead of last "dones" #105

AndyShih12 · 2020-07-15T20:39:12Z

Took me a really long time to debug this, so hopefully this helps others out.

Describe the bug
The on-policy rollout collects last_obs, current reward, current dones. See here

In stable-baselines, the rollout collects last_obs, current reward, and last dones. See here

This messes up the returns and advantage calculations.

I fixed this locally, and PPO improved dramatically on my custom environment (red is before fix, green is after fix).

Miffyli · 2020-07-15T21:10:50Z

Thank you a ton for this! I will implement and test this out on Atari envs, which have also reached lower performance compared to stable-baselines results. My apologies that you had to debug for such a nasty mistake!

I will double-check if the sb3 code is handling dones as in sb. If I include this in an update, would it be because that I include your name in the contributors? Debugging this error out counts as a good contribution, in my eyes :)

AndyShih12 · 2020-07-15T21:26:33Z

Sure, that would be appreciated. Thanks for the quick response, hope it helps on the Atari envs too!

Miffyli · 2020-07-16T08:51:11Z

@AndyShih12

Could you confirm that this is type of fix you implemented? It initializes and stores the dones from previous iteration. I ran this on some Atari games but it did not improve results, but it is uncertain if this fix would help there (I was hoping it would ^^'), and I want to make double-sure I patched it as you did before moving on.

--- a/stable_baselines3/common/base_class.py
+++ b/stable_baselines3/common/base_class.py
@@ -117,6 +117,7 @@ class BaseAlgorithm(ABC):
         self.tensorboard_log = tensorboard_log
         self.lr_schedule = None  # type: Optional[Callable]
         self._last_obs = None  # type: Optional[np.ndarray]
+        self._last_dones = None  # type: Optional[np.ndarray]
         # When using VecNormalize:
         self._last_original_obs = None  # type: Optional[np.ndarray]
         self._episode_num = 0
@@ -444,6 +445,7 @@ class BaseAlgorithm(ABC):
         # Avoid resetting the environment when calling ``.learn()`` consecutive times
         if reset_num_timesteps or self._last_obs is None:
             self._last_obs = self.env.reset()
+            self._last_dones = np.zeros((self._last_obs.shape[0],), dtype=np.bool)
             # Retrieve unnormalized observation for saving into the buffer
             if self._vec_normalize_env is not None:
                 self._last_original_obs = self._vec_normalize_env.get_original_obs()
diff --git a/stable_baselines3/common/on_policy_algorithm.py b/stable_baselines3/common/on_policy_algorithm.py
index 2937b77..52d8573 100644
--- a/stable_baselines3/common/on_policy_algorithm.py
+++ b/stable_baselines3/common/on_policy_algorithm.py
@@ -153,8 +153,9 @@ class OnPolicyAlgorithm(BaseAlgorithm):
             if isinstance(self.action_space, gym.spaces.Discrete):
                 # Reshape in case of discrete action
                 actions = actions.reshape(-1, 1)
-            rollout_buffer.add(self._last_obs, actions, rewards, dones, values, log_probs)
+            rollout_buffer.add(self._last_obs, actions, rewards, self._last_dones, values, log_probs)
             self._last_obs = new_obs
+            self._last_dones = dones

         rollout_buffer.compute_returns_and_advantage(values, dones=dones)

araffin · 2020-07-16T09:50:45Z

@AndyShih12 thank you for pointing out that issue.

I fixed this locally, and PPO improved dramatically on my custom environment

Could you tell us from about your custom env and the hyperparameters you used for PPO?
I would expect an environment with either sparse reward or short episodes (~100 timesteps), as the effect should be hard to see on long episodes.

Could you confirm that this is type of fix you implemented?

@Miffyli I was about to ask you to push a draft PR. I think we can fix directly the GAE computation without having to add extra variable, no?

Miffyli · 2020-07-16T10:29:04Z

I concur this effect is more pronounced in short episodes or reward-at-terminal games, but just to make sure I tested it on Atari.

I was about to ask you to push a draft PR. I think we can fix directly the GAE computation without having to add extra variable, no?

Ok, I will begin a draft on matching PPO/A2C discrete-policy performance with stable-baselines (with Atari games). We can fix it that way in GAE as well, but for that I want to make sure other code does not expect the sb2-style-dones.

AndyShih12 · 2020-07-16T17:28:58Z

@Miffyli Yes your patch is the same as what I did locally.

@araffin Indeed, my environment was sparse rewards and (very) short episodes. The reward only came at the last timestep of each episode, the previous code didn't receive any signal at all.

Sorry to hear that it didn't close the gap for Atari games.

ryankirkpatrick97 · 2021-03-26T23:23:25Z

@araffin and @Miffyli
Could either of you describe to me the reasoning for wanting to use last_dones instead of current dones, besides having it match the original stable-baselines code?

To me it seems more like an issue with compute_returns_and_advantages (especially because the final dones value is passed into that method). Couldn't next_non_terminal be calculated as such:
next_non_terminal = 1.0 - self.dones[step] for all steps?

It also seems to differ from how collecting samples for off-policy agents works, seen here

There's also the case mentioned in this issue, where the dones array after completion of the batch is missing the final dones value, and instead, has a default False value set for the first item.

I have created a test python file, zipped here, that highlights my concern. (Be wary of the ZeroDivisionError lol, just rerun a couple of times)

The environment works as follows. Max episode time is 3 and observations are singular and represent the timestep in the environment. A done=True is thrown when the agent tries to step from timestep 3 to 4. If a batch were collected for 3 full episodes, the final observation array should be as follows:
Obs: [1, 2, 3, 1, 2, 3, 1, 2, 3]

There is, however, a discrepancy between how the actual dones looks and how I expect the dones too look.
Actual dones from PPO: [False, False, False, True, False, False, True, False, False]
Expected dones & how DQN looks: [False, False, True, False, False, True, False, False, True]
Again, issue is that actual dones array has prepended extra False value, and is missing final done value.

TLDR:

Revert to storage of current dones in dones array.
compute_returns_and_advantages should calculate next_non_terminal = 1.0 - self.dones[step] for all steps

Miffyli · 2021-03-27T01:08:58Z

@ryankirkpatrick97 Hey! Original motivation was to copy stable-baselines (just to get things working the same way), yes. I think that too stems from OpenAI's baselines.

The discrepancy stems from the bit different ideologies of off/on-policy buffers:

Replay buffer stores experiences (s, s', a, r, d), where d is accompanied with s' (note that we can not save an experience for the first step).
Rollout buffer stores experiences (s, a, r, d), where d is accompanied with s. With the example environment you tell algorithms to only collect 9 experiences, and it does so.

Semantically, dones in both buffers now have the same meaning: they tell if the accompanied state was first of a new episode*. This also matches with the logic of VecEnvs. Technically this means the first item in dones in rollout-buffers should be True, not False (in practice this is never used).

As for doing such modifications: we might have to leave things as is, because modifying this might break others' code very silently (e.g. they read dones from buffer, expect them to store in current format but suddenly it changes).

PS: I was writing this at 3am so I might be talking out of my butt here. Feel free to correct and bash me for that :').

(*) With VecEnvs, done=True does not mean the accompanied state was terminal, rather it means there was a episode reset between previous and the accompanied state (terminal state is not included).

ryankirkpatrick97 · 2021-03-27T04:00:52Z

@Miffyli thanks for the quick and detailed response! But hahah, hopefully you get a good nights rest!

Apologies, but I seem to have some misconceptions on the ideologies between off/on-policy buffers. I would appreciate if you could help clear them up.

I have read through the VecEnv code enough to understand what you mean about never returning a terminal state, and instead resetting when done is True.

However, I don't understand what you mean when you say that for replay buffers, you store d with accompanied s' and that you cannot save experience for the first step. When a replay buffer stores (s, s', a, r, d), I read that as experience being stored for s and that the d stored in that buffer position corresponds to the transition from s to s'. Why wouldn't that work for the experience for the first step? I have also written pseudo-code for the Replay Buffer, for quick reference:

"""Replay Buffer Pseudo-Code"""
s = env.reset()
while not d:
    a, values, log_probs = policy(s)
    s', r, d, info = env.step(a)

    replay_buffer.add(s, s', a, r, d, values, log_probs)
    s = s'

When a rollout buffer stores (s, a, r, d) I assume that the d represents the same type of transition, but to an s' that isn't stored, since it would be unused.

"""Rollout Buffer Pseudo-Code in my expectation"""
s = env.reset()
while rollout_incomplete:
    a, values, log_probs = policy(s)
    s', r, d, info = env.step(a)

    rollout_buffer.add(s, a, r, d, values, log_probs)
    s = s'
_, last_values, _ = policy(s)
rollout_buffer.compute_returns_and_advantages(last_values)

If you look at my code example, the same 9 observations are traversed for both DQN and PPO, but the resulting dones arrays are different. From what you explained earlier, I understand that the values are shifted, because in the current codebase, they are stored with their accompanied next_state returned from env.step. However, the fact that the dones arrays are different, implies to me that the semantic meaning of the arrays are different.

I believe the semantic meaning should switch to "d comes from s and a and represents the last non-terminal state within a buffer" and then should be stored alongside the corresponding s and a. This would be the same as how rewards are stored. Why should a done value corresponding to reaching a terminal state not be stored with the reward value for reaching that terminal state? This is a little sassy, but with a name like "dones", I expect this over what you described.

I feel that in the long run, the discrepancy between the done arrays will become a greater source of error and confusion, than if the current code base changes and subtly breaks other's code.

Again, I could totally have a misconception here, and I apologize, I wrote this a little too sassy and opinionated than necessary haha. Hopefully I explained my thoughts well. Any direction and clarification is much appreciated, and I'm looking forward to your response and future discussion.

For completeness, here is pseudo-code of the current implementation for Rollout Buffers:

"""Current Rollout Buffer Pseudo-Code"""
d- = False  # or to your comment, correctly True
s  = env.reset()
while rollout_incomplete:
    a, values, log_probs = policy(s)
    s', r, d, info = env.step(a)

    rollout_buffer.add(s, a, r, d-, values, log_probs)
    s  = s'
    d- = d
_, last_values, _ = policy(s)
rollout_buffer.compute_returns_and_advantages(last_values, d)

Miffyli · 2021-03-27T11:46:47Z

Gah. I needed way too much coffee for this one ^^.
Right, I think I was talking from my sleepy head here. Let me summarize to make sure I understood the point.

Replay buffer: The d in experience (s, s', a, r, d) tells the following: "did agent land to a terminal state after state s?"
Rollout buffer: The d in experience (s, a, r, d) tells the following: "was state s the first step in the environment?"

You are right, semantically they mean different things. This is because ideologies are bit different in on-policy (e.g. A2C, PPO) and off-policy (e.g. SAC, DQN) worlds:

In on-policy training, we want to store trajectories of steps (s_t, a_t, r_t, d_t), where t is same for all items. Note that r_t is reward we get after executing a_t (this differs from Sutton's definition (Section 3.1 here)).
In off-policy training, you store the done information of next state. This might originate from the original DQN paper (Algorithm 1 in Methods, part about "if episode terminates at j+1"), and has remained as is since then.

Sounds like one clarifying thing would be to call dones array in ReplayBuffer next_dones instead, for clarity? I agree confusions like these will hurt in future, but we are rather adamant about backwards compatibility and avoiding modifications that could silently kill somebody's code (one of our pet peeves with other libraries, coughgymcough).

I am sorry if I skipped some parts of your argument. I spent wayyyy to long pondering on this again, and do point me to arguments I might have skipped (I have to dash now). These things are kind of fun to ponder on, so I will give it another try tomorrow :')

Edit: If you feel like discussing these topics more (and about the semantic meanings of data stored in different libraries) etc, RL Discord has bunch of people who I bet would be up for such discussion!

araffin · 2021-03-27T12:45:21Z

What about renaming dones in on-policy, episode_starts (or first_steps)? (so we have a different name but we keep the code as-is, as we already spent enough time fixing bugs due to this piece of code)

EDIT: @Miffyli i would need to check the code more carefuly, this idea came after reading your comment

ryankirkpatrick97 · 2021-03-29T20:01:36Z

@Miffyli, thanks for taking the time to provide clarification!

I think just personally I have spent more time reading code than I have spent looking through the literature, and that's where the confusion came from.

Sounds like that simple rename might be the easiest way to go about altering things. Going to leave the big decisions up to you guys 😉

AndyShih12 changed the title ~~[bug] on-policy rollouts collects current "dones" instead of last "dones"~~ [bug] on-policy rollout collects current "dones" instead of last "dones" Jul 15, 2020

Miffyli added the bug Something isn't working label Jul 15, 2020

Miffyli mentioned this issue Jul 16, 2020

Match performance with stable-baselines (discrete case) #110

Merged

20 tasks

araffin closed this as completed in #110 Aug 3, 2020

araffin mentioned this issue Nov 25, 2020

[Question] total_episode_reward_logger is wrongly handled due to the way of storing dones hill-a/stable-baselines#1049

Closed

Miffyli added the documentation Improvements or additions to documentation label Mar 29, 2021

araffin mentioned this issue Mar 31, 2021

Add test for GAE + rename RolloutBuffer.dones for clarification #375

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] on-policy rollout collects current "dones" instead of last "dones" #105

[bug] on-policy rollout collects current "dones" instead of last "dones" #105

AndyShih12 commented Jul 15, 2020

Miffyli commented Jul 15, 2020 •

edited

Loading

AndyShih12 commented Jul 15, 2020

Miffyli commented Jul 16, 2020 •

edited

Loading

araffin commented Jul 16, 2020

Miffyli commented Jul 16, 2020 •

edited

Loading

AndyShih12 commented Jul 16, 2020

ryankirkpatrick97 commented Mar 26, 2021

Miffyli commented Mar 27, 2021 •

edited

Loading

ryankirkpatrick97 commented Mar 27, 2021 •

edited

Loading

Miffyli commented Mar 27, 2021 •

edited

Loading

araffin commented Mar 27, 2021 •

edited

Loading

ryankirkpatrick97 commented Mar 29, 2021

[bug] on-policy rollout collects current "dones" instead of last "dones" #105

[bug] on-policy rollout collects current "dones" instead of last "dones" #105

Comments

AndyShih12 commented Jul 15, 2020

Miffyli commented Jul 15, 2020 • edited Loading

AndyShih12 commented Jul 15, 2020

Miffyli commented Jul 16, 2020 • edited Loading

araffin commented Jul 16, 2020

Miffyli commented Jul 16, 2020 • edited Loading

AndyShih12 commented Jul 16, 2020

ryankirkpatrick97 commented Mar 26, 2021

Miffyli commented Mar 27, 2021 • edited Loading

ryankirkpatrick97 commented Mar 27, 2021 • edited Loading

Miffyli commented Mar 27, 2021 • edited Loading

araffin commented Mar 27, 2021 • edited Loading

ryankirkpatrick97 commented Mar 29, 2021

Miffyli commented Jul 15, 2020 •

edited

Loading

Miffyli commented Jul 16, 2020 •

edited

Loading

Miffyli commented Jul 16, 2020 •

edited

Loading

Miffyli commented Mar 27, 2021 •

edited

Loading

ryankirkpatrick97 commented Mar 27, 2021 •

edited

Loading

Miffyli commented Mar 27, 2021 •

edited

Loading

araffin commented Mar 27, 2021 •

edited

Loading