-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evaluate_policy
reports preprocessed reward, whereas rollout/ep_rew_mean
is unprocessed
#181
Comments
Good catch! I agree we should report both if possible (i.e. checks for the "r" and "l" terms in info-dict), and thanks to Monitor wrapper this should be straight-forward. Updating all wrappers to work like I think we should also add a warning in Monitor that warns if it is not the lowest-level wrapper, because otherwise "r" or "l" can get distorted like this. However this might not be too practical as some environments come with "built-in" wrappers by default like Timeouts in Atari envs. |
I like this idea. We should be able to whitelist wrappers that are known not to change rewards -- so |
Thanks for raising that issue.
👼 yes, I tried to avoid that issue when possible, that's also why the atari wrapper now uses a monitor wrapper first. |
Good point. We might need to modify May take a couple of weeks before I have time to work on a PR for this, so if anyone else wants to start then feel free. |
Hmm, a simple solution would to check the existence of the keys "r" and "l" in the info dict when calling |
I believe the keys are only present at the end of the episode, and the episode could end at a different time in the original (unwrapped) environment. So I'm not sure how to make a fool-proof check for this (though could imagine something heuristic). |
I tried to work on this but ran into some design decisions I could not overcome. I was updating |
yes, it is not possible. And wrapping with a
we cannot as many env rely on vec env wrappers like |
@Miffyli are you still working on it? otherwise I will give it a try ;) |
@araffin Yup I will! Start of the week was super-busy with urgent deadlines but today I will return to this ^^' |
Describe the bug
There are two kinds of reward that it might make sense to report:
VecNormalize
orAtariPreprocessing
have been applied. We hope the arg max of this is the same as the arg max of the original environment reward, but it need not be. Even if it is, they can vary substantially -- e.g. rescaling by a positive constant, or addition of shaping.Currently,
common.evaluation.evaluate_policy
reports the wrapped reward. Most (all?) RL algorithms report the unwrapped reward, e.g.OffPolicyAlgorithm._dump_logs
inrollout/ep_rew_mean
. The difference iscommon.evaluate.evaluate_policy
directly records reward and computes statistics; whereas the RL algorithms rely on the"r"
and"l"
keys inserted into the info-dict bycommon.Monitor
which is usually the first wrapper applied to the environment (before any preprocessing).In my opinion we would ideally report both types of reward (and episode length -- since wrappers can also affect
done
), in both the training and evaluation environment. But if we had to pick just one, I'd advocate for swapping the two -- report the reward used for training during training, and the reward usually used for evaluation for the evaluation environment.Credit to @ejmichaud for first noticing this discrepancy.
Code example
This is visible in train.py from Zoo on Atari:
Note that not only do
eval/
androllout/
disagree on reward per timestep (becauseAtariPreprocessing
does reward clipping), they also disagree on the episode length (becauseAtariPreprocessing
converts loss of life into loss of an episode)! See seaquest.log for full log.Note
train.py
does do some magic to disable reward normalization inVecNormalize
. So this problem I think won't be visible in e.g. MuJoCo. Conceivably one could do something similar forAtariWrapper
(setterminal_on_life_loss=False
andclip_reward=False
) -- but doing this for every wrapper seems error-prone, and some wrappers may just not support this out of the box.The text was updated successfully, but these errors were encountered: