-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Formal API handling of truncation vs termination #2510
Comments
(For reference, here's the link to the how the dm_env API handles termination and truncation) |
If I recall, the one class responsible for the truncation is the |
I am in favor of this breaking change option 1. This is a pretty fundamental issue that should be made explicit. The |
Wouldn't only envs that manually apply the |
So there's a general problem with the ALE environments in that v0 and v4 don't implement correct (or reasonable), defaults, making properly reproducing results by the ALE group or DeepMind impossible. They also preclude using correct preprocessing schemes. The plan is to remove these accordingly, though you're right that removing them before making this change may be required, which I had not realized. In the case of strictly reproducing old code, people could use old versions of Gym. This change could obviously be baked into any environment without the time limit wrapper. |
I also think that the correct move here is to include truncation via argument before 1.0 and make it mandatory at 1.0. That would provide the easiest transition and most reasonable behavior. |
My 2 cents: I would be for option 3 - Turn done into a custom three state variable, or give it a method like done.is_truncated. Specifically an enumeration, which hopefully can be backwards compatible: class Done(Enum):
not_done = 0
terminated = 1
truncated = 2 I find that separate bool for truncated is not worth it because 1) it's always False if done is False, 2) it's not succinct. The argument that it might be non-obvious to beginners is not outweighing it to me because I'd like a library that can be beautiful once I understand it. Think about all the cases when you don't want to differentiate terminated / truncated, you will always have to unpack it properly at the very least. |
I prefer option 3 (Turn done into a custom three-state variable) too. It's easy to be backward compatible with integers. NOT_DONE = 0
TERMINATED = 1
TRUNCATED = 2
For testing whether we need to start a new episode: bool(NOT_DONE) # -> False
bool(TERMINATED) # -> True
bool(TRUNCATED) # -> True The following snippet works fine for both boolean-based done and integer-based done. observation = env.reset()
done = NOT_DONE
while not done:
action = policy(observation)
observation, reward, done, info = env.step(action) For value iteration in training (e.g. DQN): abs(1 - NOT_DONE) # -> 1
abs(1 - TERMINATED) # -> 0
abs(1 - TRUNCATED) # -> 1 Formerly, the target value is calculated by: Q_target = reward + gamma * (1 - done) * max(Q(next_observation, *)) With the three-state variable, we only need to add Q_target = reward + gamma * abs(1 - done) * max(Q(next_observation, *)) The operator |
@XuehaiPan thanks for including the snippet. Option 3 then appears very convenient engineering-wise. Your explainer is very well-written, and I'd suggest adding it to the documentation and possibly the blog post that explains the new gym version. |
Sorry for the delayed reply. I spoke to @vwxyzjn and @RedTachyon about this off line at length. Neither of them had a very strong opinion between [0,1,2] or two booleans, but after the discussion I would very much prefer if we used two booleans. This is for the following reasons:
|
This project has 26k stars and an untold greater number of users. I wouldn't underestimate how big of a hassle changing the return signature of a core method like @XuehaiPan's suggestion adds the new behavior without requiring a single line to be changed in any existing user code. Personally I'm not a big fan of the enum consts and would prefer _, _, done, _ = env.step(_)
if done and done.truncated:
...
print(done.reason) Cheers, |
@ChrisCummins Indeed the enum constants confuse users, especially beginners. It may need much detailed documentation and examples. Include a DQN implementation tutorial in the docs may even be better.
I choose dones = [ %%% list of DoneType %%% ]
dones = np.array(dones)
dones.dtype # -> np.dtype('O') In [1]: from enum import Enum
In [2]: class DoneType(Enum):
...: NOT_DONE = 0
...: DONE = TERMINATED = 1
...: TRUNCATED = 2
...:
...: def __bool__(self):
...: return bool(self.value)
...:
...: def __int__(self):
...: return int(self.value)
...:
...: def __float__(self):
...: return float(self.value)
In [3]: import numpy as np
In [4]: dones = np.array([DoneType.NOT_DONE, DoneType.DONE])
In [5]: dones
Out[5]: array([<DoneType.NOT_DONE: 0>, <DoneType.DONE: 1>], dtype=object)
In [6]: dones.dtype
Out[6]: dtype('O')
In [7]: dones.astype(int)
Out[7]: array([0, 1]) To support auto type casting with NumPy, we should inherit class DoneType(int):
def __new__(cls, done, truncated=False):
done = super().__new__(cls, bool(done))
done.truncated = truncated
return done
@property
def truncated(self):
return self.__truncated
@truncated.setter
def truncated(self, value):
self.__truncated = bool(value)
def __str__(self):
return f"DoneType({bool(self)}, truncated={self.truncated})"
__repr__ = __str__
def episode_ended(self):
return self.done or self.truncated In [1]: import numpy as np
In [2]: class DoneType(int):
...
In [3]: done = DoneType(True)
In [4]: done
Out[4]: DoneType(True, truncated=False)
In [5]: done.truncated = True
In [6]: done
Out[6]: DoneType(True, truncated=True)
In [7]: dones = [
...: DoneType(False, truncated=False), # not done
...: DoneType(True, truncated=False), # terminated
...: DoneType(False, truncated=True), # not done but truncated
...: DoneType(True, truncated=True), # done and truncated at same time
...: ]
In [8]: np.array(dones)
Out[8]: array([0, 1, 0, 1])
In [9]: np.array([done.truncated for done in dones])
Out[9]: array([False, False, True, True]) The problem is how to tensorize |
@jkterry1 Showed in the third point, it's not backward compatible. I would rather leave the I have another proposal (1-3 in #2510 (comment)):
Old API: obs = env.reset()
while True:
action = policy(obs)
obs, reward, done, info = env.step(action)
if done:
break New API: obs = env.reset()
while True:
action = policy(obs)
obs, reward, done, info = env.step(action)
if env.ended():
break # `done` maybe `True` when the episode is truncated We only need to add truncation behavior to cc @vwxyzjn @RedTachyon. |
Hey, I spoke to Costa about this offline a bit again. Your core concern, along with @ChrisCummins 's, appears to be about the breaking nature of this change. Costa shared some of these, and if he does then clarifying further is clearly prudent. To clarify, my intention would be that, until the 1.0 release, there would be a I also think that making a breaking change to do this incredibly important feature right in a permanent manner in 1.0 is more important than retaining backwards compatibility, especially given that the Gym will almost certainly be used a ton more in the next 5 years than it has in the past 5. Regarding how to do this if we're doing a breaking change, I don't think that adding the truncation method to the API is a clean or natural way compared to the original options 1-3 |
The above snippet is observation, reward, done, _, info = env.step(action) The return value of Besides, the main concern for me is that we never use obs = env.reset()
while True:
action = policy(obs)
next_obs, reward, done, truncated, info = env.step()
buffer.add((obs, next_obs, reward, done, info)) # `truncated` is not added
obs = next_obs
if done or truncated:
break For me, I would support 3 instead of 1:
Either enumerate or new type The following snippet supports backward compatibility and NumPy array conversion. class DoneType(int):
def __new__(cls, done, truncated=False):
done = super().__new__(cls, bool(done))
done.truncated = truncated
return done
@property
def truncated(self):
return self.__truncated
@truncated.setter
def truncated(self, value):
self.__truncated = bool(value)
def __str__(self):
return f"DoneType({bool(int(self))}, truncated={self.truncated})"
__repr__ = __str__
def episode_ended(self):
return self.done or self.truncated
__bool__ = episode_ended # this means bool(int(self)) != int(bool(self)) obs = env.reset()
while True:
action = policy(obs)
next_obs, reward, done, info = env.step()
buffer.add((obs, next_obs, reward, done, info))
obs = next_obs
if done: # here we call `done.__bool__()`
break |
One thing is not clear to me, independent of the implementation, which is the semantic and allowed cases (and should made clear in the doc #2510 (comment)):
If If |
Yes, In training, the episode can be truncated by the horizon limit. Imagine an environment that has 10,000 steps for each episode (e.g. In testing or evaluation, the
The episode is terminated because of some reason. And it also reaches the horizon limit and is |
My preferred semantic would be for done=False and truncated=True to not be allowed. If an environment is truncated this is obviously a kind of done, and no information is lost via this that I'm aware of, and this would make retaining handling compatibility significantly easier. |
@jkterry1 I think the semantic Do you mean:
Whether the next observation is valid or not:
# Sampling phase
obs = env.reset()
while True:
action = policy(obs)
next_obs, reward, done, truncated, info = env.step(action)
buffer.add((obs, next_obs, reward, done, truncated, info))
if done:
break
next_obs = obs
####################
# Training phase
if truncated or not done:
target_value = reward + gamma * value_func(next_obs)
else:
target_value = reward
value_loss = 0.5 * square( value_func(obs) - target_value ) This needs to add # Sampling phase
obs = env.reset()
while True:
action = policy(obs)
- next_obs, reward, done, info = env.step(action)
- buffer.add((obs, next_obs, reward, done, info))
+ next_obs, reward, done, truncated, info = env.step(action)
+ buffer.add((obs, next_obs, reward, done, truncated, info))
if done:
break
next_obs = obs
####################
# Training phase
-if not done:
+if truncated or not done:
target_value = reward + gamma * value_func(next_obs)
else:
target_value = reward
value_loss = 0.5 * square( value_func(obs) - target_value ) For me:
Whether the next observation is valid or not:
# Sampling phase
obs = env.reset()
while True:
action = policy(obs)
next_obs, reward, done, truncated, info = env.step(action)
buffer.add((obs, next_obs, reward, done, info))
if done or truncated:
break
next_obs = obs
####################
# Training phase
if not done:
target_value = reward + gamma * value_func(next_obs)
else:
target_value = reward
value_loss = 0.5 * square( value_func(obs) - target_value ) The only change is in the sampling phase. # Sampling phase
obs = env.reset()
while True:
action = policy(obs)
- next_obs, reward, done, info = env.step(action)
+ next_obs, reward, done, truncated, info = env.step(action)
buffer.add((obs, next_obs, reward, done, info))
- if done:
+ if done or truncated:
break
next_obs = obs And we can implement tricky |
@XuehaiPan just a quick thought regarding your previous comment,
In this case can we not add |
@arjun-kg You'd better not do that. That makes done_in_train = done_in_sample and not truncated which makes the code harder to understand and causes trouble in future maintenance. The meaning of For me, both semantic of
IMO, I prefer the second one. We should reach an agreement before submitting a PR to implement it. Any thoughts? |
The answers just confirmed what I was afraid of: this API change brings confusion even for the ones implementing the feature :/ (which mean even more confusion for new comers, in addition to a major breaking change). In my opinion, a better alternative is to use the current API and make use of the
|
Hey @araffin, that is a good point. I think the new API at this point would be
So there is no ambiguous |
hmm, not sure how renaming |
Renaming obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
obs = env.reset() I think just removing the If termination and truncation happen at the same time, in which case I think it will return All in all, the possible cases are
|
@vwxyzjn mmh, for me
If you look at the actual PR #2752, you will notice a weird pattern that I described some months ago (here #2510 (comment)): all the envs return I don't think that will help users reading the code. Either a |
@araffin I think explicit boolean variable
|
well, in that case, you do the same as for a batch of dict observation, you filter and then convert it (see gym/gym/vector/utils/numpy_utils.py Line 65 in da7b8ae
Anyway, from an algorithm point of view, you need to do a for loop on each env, otherwise you might do computation that is not needed (https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/on_policy_algorithm.py#L193).
I don't know what is
why not? For RNN and on-policy algorithms, you don't need to know about timeout for padding): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/feat/ppo-lstm/sb3_contrib/common/recurrent/buffers.py#L129 EDIT: looking at your issue on RLlib, it look like a very specific usecase (additional loss function for RNN using separate rewards) that very few people would encounter. For such a specific use case, I would recommend to fork the library and adapt it to your needs |
That's what exactly I meant:
You need a
We can save almost anything in the If we are only talking about a officially supported key BTW, search result for key |
I'm glad we arrive to the same conclusion =)! That's exactly my point since the beginning (#2510 (comment)):
This is mostly an issue with github search...
(ray doesn't seem to support truncation: ray-project/ray#10093)
I don't think so, you can just stack along the batch axis for the given keys (in the truncation case, this is just booleans), see: gym/gym/vector/utils/numpy_utils.py Line 53 in da7b8ae
And since Python 3.6, dictionaries are preserving the order (as for ordered dict). But this is off topic, this issue is about truncation, not passing any additional arrays to the training. (also mentioned, you need to do a for loop anyway for truncation, unless you want to add some masking, which is more efficient but less readable) |
Closing in favor of #2752 |
Could you please elaborate on this? I've been asking myself the question how to reproduce the Atari setting as done by Deepmind in their publications and I've got the impression, that the
I know You wrote that with Btw in terms of the new |
@jkterry1 After looking at the source, I get the impression, that is rather hard to reproduce the setup with the |
Right now, one of the biggest weaknesses of the Gym API is that Done is used for both truncation and termination. The problem is that algorithms in Q learning family (and I assume others), depend on the differentiation between a terminal state and a truncated state to function as intended. Right now, Gym solves this for some (but not all) environments with infos for built in environments via the time limiting wrapper inserting this information into infos. No third party environments I'm aware of use this, and most learning code uses this. This has resulted in a few problems:
This sort of thing in the opinion of myself and those I've spoken to at OpenAI warrants a breaking change in the pursuit of a 1.0 release. There are three options for making the breaking change:
observation, reward, done/terminated, truncated, info = env.step(action)
done
into a custom three state variable, or give it a method likedone.is_truncated
My preference is option 1. I think that option 3 is too confusing and wouldn't generally be used because it's not explicit. I also think that option 2 adds additional confusion and complexity over option 1 without really adding any additional utility. I'm creating this issue so that people can discuss it before a PR is made. This feature is intended to be released in the release of gym after the current.
Regarding breaking changes and backwards compatibility:
_,
In other words, it's my view that the level of breaking this update will be is a minor annoyance, and I think that this is a very important feature. This change will be the second of three planned small changes to the base API before the 1.0 release (the first was to the seed method, the last will be to the render method). I am obviously aware of the profound role Gym plays in the community, however Gym will almost certainly be used a lot more over the next 5 years than the past 5, and I think that a series of minor breaking changes to clean things up now in the pursuit of a proper 1.0 release will be very worth it.
Please offer your thoughts on which of the three options here is best in the comments if you would like (or new options if I'm missing something) and remain civil and inviting to others.
The text was updated successfully, but these errors were encountered: