`evaluate_policy` reports preprocessed reward, whereas `rollout/ep_rew_mean` is unprocessed #181

AdamGleave · 2020-10-08T11:18:48Z

Describe the bug

There are two kinds of reward that it might make sense to report:

The unwrapped reward. This is computed by the original environment. This defines the objective of the RL task, and what would usually be used to compute the policy return when e.g. benchmarking the resulting policy for papers.
The wrapped reward. This is the reward the RL algorithm actually uses for training, after any environment wrappers that mess with reward like VecNormalize or AtariPreprocessing have been applied. We hope the arg max of this is the same as the arg max of the original environment reward, but it need not be. Even if it is, they can vary substantially -- e.g. rescaling by a positive constant, or addition of shaping.

Currently, common.evaluation.evaluate_policy reports the wrapped reward. Most (all?) RL algorithms report the unwrapped reward, e.g. OffPolicyAlgorithm._dump_logs in rollout/ep_rew_mean. The difference is common.evaluate.evaluate_policy directly records reward and computes statistics; whereas the RL algorithms rely on the "r" and "l" keys inserted into the info-dict by common.Monitor which is usually the first wrapper applied to the environment (before any preprocessing).

In my opinion we would ideally report both types of reward (and episode length -- since wrappers can also affect done), in both the training and evaluation environment. But if we had to pick just one, I'd advocate for swapping the two -- report the reward used for training during training, and the reward usually used for evaluation for the evaluation environment.

Credit to @ejmichaud for first noticing this discrepancy.

Code example
This is visible in train.py from Zoo on Atari:

$ python train.py --algo dqn --env SeaquestNoFrameskip-v4
# [ ... elided ... ]
Eval num_timesteps=10000, episode_reward=52.00 +/- 48.33
Episode length: 527.20 +/- 139.00
New best mean reward!
----------------------------------
| eval/               |          |
|    mean_ep_length   | 527      |
|    mean_reward      | 52       |
| rollout/            |          |
|    ep_len_mean      | 2.16e+03 |
|    ep_rew_mean      | 87       |
|    exploration rate | 0.989    |
| time/               |          |
|    episodes         | 20       |
|    fps              | 717      |
|    time_elapsed     | 14       |
|    total timesteps  | 10698    |
----------------------------------

Note that not only do eval/ and rollout/ disagree on reward per timestep (because AtariPreprocessing does reward clipping), they also disagree on the episode length (because AtariPreprocessing converts loss of life into loss of an episode)! See seaquest.log for full log.

Note train.py does do some magic to disable reward normalization in VecNormalize. So this problem I think won't be visible in e.g. MuJoCo. Conceivably one could do something similar for AtariWrapper (set terminal_on_life_loss=False and clip_reward=False) -- but doing this for every wrapper seems error-prone, and some wrappers may just not support this out of the box.

The text was updated successfully, but these errors were encountered:

Miffyli · 2020-10-08T11:45:37Z

Good catch! I agree we should report both if possible (i.e. checks for the "r" and "l" terms in info-dict), and thanks to Monitor wrapper this should be straight-forward. Updating all wrappers to work like AtariWrapper would become a mess and easy-to-forget down the line.

I think we should also add a warning in Monitor that warns if it is not the lowest-level wrapper, because otherwise "r" or "l" can get distorted like this. However this might not be too practical as some environments come with "built-in" wrappers by default like Timeouts in Atari envs.

AdamGleave · 2020-10-08T13:14:40Z

I think we should also add a warning in Monitor that warns if it is not the lowest-level wrapper, because otherwise "r" or "l" can get distorted like this. However this might not be too practical as some environments come with "built-in" wrappers by default like Timeouts in Atari envs.

I like this idea. We should be able to whitelist wrappers that are known not to change rewards -- so TimeLimit for sure, but also potentially e.g. subclasses of gym.ObservationWrapper and gym.ActionWrapper. I'm sure there'll be edge cases we miss but users can always filter the warning.

araffin · 2020-10-09T17:09:52Z

Thanks for raising that issue.
The main problem is that it requires users to always wrap their environment with a monitor wrapper (we do that internally though).
Otherwise, I would be for reporting both, the issue is that we cannot really default to the unnormalized reward because we are not sure to find "r" and "l" keys in the info dict...
So yes, defaulting to unnormalized reward sounds good and warning the user if the env is not wrapped properly. This would also require some update to the documentation.

Note train.py does do some magic to disable reward normalization in VecNormalize

👼 yes, I tried to avoid that issue when possible, that's also why the atari wrapper now uses a monitor wrapper first.

AdamGleave · 2020-10-11T11:03:43Z

The main problem is that it requires users to always wrap their environment with a monitor wrapper (we do that internally though).
Otherwise, I would be for reporting both, the issue is that we cannot really default to the unnormalized reward because we are not sure to find "r" and "l" keys in the info dict...
So yes, defaulting to unnormalized reward sounds good and warning the user if the env is not wrapped properly. This would also require some update to the documentation.

Good point. We might need to modify Monitor to allow us to detect if it has been applied. With an Env we can just recurse down the wrapper stack and check if Monitor is present. But with VecEnv the individual Envs may not be accessible (e.g. SubprocVecEnv). But we could have Monitor define some special attribute, e.g. _monitor_applied, and then use VecEnv.get_attr to check it exists.

May take a couple of weeks before I have time to work on a PR for this, so if anyone else wants to start then feel free.

araffin · 2020-10-11T11:30:10Z

We might need to modify Monitor to allow us to detect if it has been applied

Hmm, a simple solution would to check the existence of the keys "r" and "l" in the info dict when calling env.step(), no?

AdamGleave · 2020-10-11T11:44:59Z

We might need to modify Monitor to allow us to detect if it has been applied

Hmm, a simple solution would to check the existence of the keys "r" and "l" in the info dict when calling env.step(), no?

I believe the keys are only present at the end of the episode, and the episode could end at a different time in the original (unwrapped) environment. So I'm not sure how to make a fool-proof check for this (though could imagine something heuristic).

Miffyli · 2020-11-06T00:18:46Z

I tried to work on this but ran into some design decisions I could not overcome. I was updating evaluate_policy to always use Monitor's episode info, but automatically wrapping the eval environment (in case it is not already wrapped) is tricky with vecenvs (if subprocvecenv, you simply can not wrap, and in other cases you need to un-vectorize the single environment, which does not sound nice either). If we can remove VecEnv support of evaluate_policy then this could be done, but I sense there is a good reason why the support is there.

araffin · 2020-11-06T09:37:08Z

I tried to work on this but ran into some design decisions I could not overcome. I was updating evaluate_policy to always use Monitor's episode info, but automatically wrapping the eval environment (in case it is not already wrapped) is tricky with vecenvs (if subprocvecenv,

yes, it is not possible. And wrapping with a VecMonitor won't solve the issue of clipped reward by lower level wrappers.
But I thought we agreed on "So yes, defaulting to unnormalized reward sounds good and warning the user if the env is not wrapped properly. This would also require some update to the documentation."

If we can remove VecEnv support of evaluate_policy then this could be done, but I sense there is a good reason why the support is there.

we cannot as many env rely on vec env wrappers like VecNormalize, VecFramestack (atari), ...

araffin · 2020-11-12T09:01:40Z

@Miffyli are you still working on it? otherwise I will give it a try ;)

Miffyli · 2020-11-12T09:04:12Z

@araffin Yup I will! Start of the week was super-busy with urgent deadlines but today I will return to this ^^'

Miffyli added the bug Something isn't working label Oct 8, 2020

araffin added the help wanted Help from contributors is welcomed label Oct 11, 2020

araffin mentioned this issue Nov 3, 2020

DQN is not converging even after 15M timesteps #214

Closed

2 tasks

Miffyli mentioned this issue Nov 12, 2020

Use Monitor episode reward/length for evaluate_policy #220

Merged

14 tasks

araffin closed this as completed in #220 Nov 16, 2020

araffin mentioned this issue May 5, 2021

[Question] EvalCallback and Monitor wrapped env #418

Closed

2 tasks

Miffyli mentioned this issue Oct 22, 2021

[Feature Request] Double DQN #487

Closed

araffin mentioned this issue Apr 19, 2022

"ep_len_mean" and "ep_rew_mean" are inconsistent for Atari environments #873

Closed

araffin mentioned this issue Apr 30, 2022

[Question] Evaluation helper on Monitor wrapped environments #894

Closed

2 tasks

araffin mentioned this issue Jul 4, 2022

Add QR-DQN Stable-Baselines-Team/stable-baselines3-contrib#13

Merged

19 tasks

araffin mentioned this issue Oct 11, 2022

Evaluation runs way too many evaluation episodes DLR-RM/rl-baselines3-zoo#296

Closed

araffin mentioned this issue Mar 14, 2023

[Question] The actual training timesteps don't correspond with the hyper-parameters for Atari DLR-RM/rl-baselines3-zoo#367

Closed

5 tasks

Panda-Shawn mentioned this issue May 31, 2023

Eval reward is much lower than rollout reward. omron-sinicx/action-constrained-RL-benchmark#2

Closed

araffin mentioned this issue Nov 20, 2023

[Question] proper order to use the monitor wrapper #1757

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`evaluate_policy` reports preprocessed reward, whereas `rollout/ep_rew_mean` is unprocessed #181

`evaluate_policy` reports preprocessed reward, whereas `rollout/ep_rew_mean` is unprocessed #181

AdamGleave commented Oct 8, 2020 •

edited

Loading

Miffyli commented Oct 8, 2020 •

edited

Loading

AdamGleave commented Oct 8, 2020 •

edited

Loading

araffin commented Oct 9, 2020

AdamGleave commented Oct 11, 2020

araffin commented Oct 11, 2020

AdamGleave commented Oct 11, 2020

Miffyli commented Nov 6, 2020

araffin commented Nov 6, 2020

araffin commented Nov 12, 2020

Miffyli commented Nov 12, 2020

evaluate_policy reports preprocessed reward, whereas rollout/ep_rew_mean is unprocessed #181

evaluate_policy reports preprocessed reward, whereas rollout/ep_rew_mean is unprocessed #181

Comments

AdamGleave commented Oct 8, 2020 • edited Loading

Miffyli commented Oct 8, 2020 • edited Loading

AdamGleave commented Oct 8, 2020 • edited Loading

araffin commented Oct 9, 2020

AdamGleave commented Oct 11, 2020

araffin commented Oct 11, 2020

AdamGleave commented Oct 11, 2020

Miffyli commented Nov 6, 2020

araffin commented Nov 6, 2020

araffin commented Nov 12, 2020

Miffyli commented Nov 12, 2020

`evaluate_policy` reports preprocessed reward, whereas `rollout/ep_rew_mean` is unprocessed #181

`evaluate_policy` reports preprocessed reward, whereas `rollout/ep_rew_mean` is unprocessed #181

AdamGleave commented Oct 8, 2020 •

edited

Loading

Miffyli commented Oct 8, 2020 •

edited

Loading

AdamGleave commented Oct 8, 2020 •

edited

Loading