-
Notifications
You must be signed in to change notification settings - Fork 8.6k
Feature suggestion: Observation wrappers
Suppose we have an Atari environment which returns the observation
Tuple(Box(210, 160, 3), Box(128,))
which is both the image and memory.
We now would like to apply some filters to the image part of the observation.
If we would like to use Wrappers we will run into problems:
env_image = pick_tuple_first_wrapper(env) # This wrappers rewrites the observation_space and returns
# a new observation corresponding to the image part.
env_ram = pick_tuple_second_wrapper(env)
Then we apply some filter to the env_image
env_downscaled_image = downscale_image_wrapper(env, (84, 84))
And now combine the two again:
env_wrapped = combine_into_tuple_wrapper(env_downscaled_image, env_wrapped)
The problem with this approach is what should happen when env_wrapped.step(0)
is called? What if the two input environments above have different properties action_space
?
We need a filter function for observations that we can apply to only the observation.
The observation filters could then be combined and applied from an Environment wrapper as we are used to.
We would use it as follows:
pick_first = observation_wrapper.pick_tuple(0)
pick_second = observation_wrapper.pick_tuple(1)
downscale = observation_wrapper.downscale(shape=(84, 84))
join = observation_wrapper.join([pick_first, downscale], [pick_second]) # join two ob. chains
# phi_4(x) := (phi_2(phi_1(x)), phi_3(x)) where phi_1 = pick_first
phi_2 = downscale
phi_3 = pick_second
phi_4 = join
env = observation_wrapper(env, join) # returns observation' := phi_4(observation)
Comments? Ideas?
Comment:
We might need the observation_space as an input. Say we have an observation wrapper called
downscale_2x which downscale an image to half width and half height. The observation_space
is not determined fully until we have an input observation_space.
What if the observation wrapper is a "frame stacking" kind of observation?
Example: Stacking four frames (observations).
phi(x_t) = (x_{t-3}, x_{t-2}, x_{t-1}, x_t), where x_k = zeros_like(x_t) for k < 0,
or x_k = x_t, for k < 0.
This can be seen as either the function phi having state or that it takes the previous output as input.
If there was no previous state, (like on a call to env.reset()) we would have the observation wrapper do what it needed to do to keep the observation fixed. In the frame stacking case a Box(dim1, dim2, ..., dimN) would be mapped to Box(4, dim1, dim2, ...., dimN) since the output observation would always have to match the contract of the env.observation_space property.
Q: When do we reset our observation wrappers? Should the environment call a reset() method on the wrappers?
class ObservationWrapper(object):
"""Observation wrapper base class."""
def __init__(self, observation_space):
self._observation_space = observation_space
@property
def observation_space(self):
return self._observation_space
def phi(self, observation):
# assert observation is an instance of self._observation_space # input obs. space.
output = observation
# assert output is an instance of self.observation_space # output obs. space
return output
def reset(self):
"""Reset internal state."""
pass
Comment (from @rldotai):
I agree it's probably better to wrap the environment in a class (since directly subclassing requires registration and may cause issues). It's possible to implement such a wrapper so that it overrides the environment's methods on a case-by-case basis while still presenting the same API (so when downsampling and wrapping frames, the environment provides the correct
Box(...)
observation space). It would also be easier to keep track of the metadata in order to ensure reproducibility. Thereset()
method would have to be defined so that it calls the wrapped environment'sreset()
method, and also resets whatever else it is keeping track of (e.g., previous frames). This might be preferable because it is somewhat more general-- for example, it would be really straightforward to alter environments to work at different timescales by overridingstep
so that it executes the selected action for multiple timesteps.Another alternative (which I am somewhat less happy with) is setting up an
Agent
class to interface between the environment and the learning algorithm. In my implementation, the agent has a feature function (phi) which transforms the observation before passing it to the learning algorithm. The problem being that in this case too you have to be careful to call reset on the agent and the environment, which would not be necessary if the environment was wrapped as above.
(This wiki entry just popped up on my feed, and having implemented something similar for benchmarking RL with function approximation I thought to share my experience)
- Gym Repository
- Wiki Home
- Leaderboard
- Environments
- FAQ
- Resources
- Feature Requests