Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of Vector Environment API Breaking Changes #2279

Open
jkterry1 opened this issue Jul 31, 2021 · 31 comments
Open

Discussion of Vector Environment API Breaking Changes #2279

jkterry1 opened this issue Jul 31, 2021 · 31 comments

Comments

@jkterry1
Copy link
Collaborator

jkterry1 commented Jul 31, 2021

There are numerous long standing complaints about the vector environment API, to the point where it is the one thing that most people have seemed to think breaking changes are warranted for (which I agree with). I've created this thread to come to a consensus on the solution.

Because whatever finalized changes we come up with here will be very significant to the community and this can only really be done once, I am specifically insisting that the following people agree to whatever breaking changes are made:

Benjamin Black (@benblack769, UMD and PettingZoo)
Antonin Raffin (@araffin, DLR RM and Stable Baselines)
Chris Nota (@cpnota, UMass Amherst and the Autonomous Learning Library)
Costa Huang (@vwxyzjn, Weights and Biases and CleanRL)
Jesse Fareboro (@JesseFarebro, MILA and the Arcade Learning Environment)

@joschu
Copy link
Contributor

joschu commented Jul 31, 2021

You might want to check out https://github.com/openai/gym3

@tristandeleu
Copy link
Contributor

There are numerous long standard complaints about the vector environment API, to the point where it is the one thing that most people have seemed to think breaking changes are warranted for (which I agree with).

Can you be more precise about these complaints that warrant breaking the API, in order to get a better understanding of what needs to be done? As far as I know, almost all the issues involving VectorEnv here have been addressed one way or another.

To be specific, there are a number of things that can be improved in my opinion, but very few of them require breaking the API:

You might want to check out https://github.com/openai/gym3

The problem with Gym3 is that it uses a different API than Gym (obs_space instead of observation_space, observe instead of step, even the spaces are different e.g. the Tuple space does not exist), so anything related to Gym3 requires a wrapper FromGymEnv to convert from a Gym environment to a Gym3 one. I also believe (I might be wrong though) that anything Gym3's SuprocEnv can do, Gym's AsyncVectorEnv can as well and more (e.g. using shared memory for observations, similar to baselines' ShmemVecEnv).

@benblack769
Copy link

My main complaint is the tuple observation/action spaces. Stable Baseline's vector environment has the observation/space just be the observation space of one of the environments, and assumes that all environments have the same spaces. This seems much more sensible to me than having a tuple observation space because the main use case of a vector environment is to put many near-identical environments together, and applying a single policy to many of them. Having a single observation space makes using the vector environment in this way much easier. Especially when you start writing vector environment wrappers, etc.

While I agree that the tuple space is more general, in this case I prefer simplicity over generality.

@benblack769
Copy link

In general, I would prefer it if Gym adopted Stable Baselines vector environment API.

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Jul 31, 2021

I would like to second @benblack769: The design of the action space of the vectorized env is cleaner in SB3 in my opinion, and I would prefer if gym adopted the SB3’ vector environment API.

It seems the only utility of the tuple action space is so that you can do env.step(env.action_space.sample()), which I think it’s ok to deprecate.

The two biggest benefits of gym3 API appears to be:

  1. Implementing multi-agent environment becomes very straightforward (see here).
  2. Its API design of act() and observe() allow the env to step asynchronously.

Regarding the first benefit, you could use SB3’s API as well (see here as an example).

Regarding the second benefit, I feel it was set up for more complicated use cases (e.g. distributed training, asynchronous rollouts), so I am inclined to keep it simple and just use SB3’s API.

@benblack769
Copy link

I would note that implementing a multi-agent environment is equally simple using SB3's vector env API. For example: https://github.com/PettingZoo-Team/SuperSuit/#parallel-environment-vectorization

@tristandeleu
Copy link
Contributor

tristandeleu commented Jul 31, 2021

My main complaint is the tuple observation/action spaces. Stable Baseline's vector environment has the observation/space just be the observation space of one of the environments, and assumes that all environments have the same spaces. This seems much more sensible to me than having a tuple observation space because the main use case of a vector environment is to put many near-identical environments together, and applying a single policy to many of them.

I'm confused, Gym's VectorEnv does not return a tuple of observations. For example:

import gym
env = gym.vector.make('CartPole-v1', num_envs=5)
observations = env.reset()
# array([[-0.04071225, -0.00272771,  0.04585911,  0.03397448],
#        [ 0.04807664, -0.00466015,  0.03836329, -0.04897803],
#        [ 0.03999503,  0.03953767, -0.02474664, -0.00865895],
#        [-0.03199483,  0.01612429, -0.03704762, -0.02390875],
#        [ 0.00967298,  0.01544605,  0.04391582,  0.0040252 ]],
#       dtype=float32)

And this is true for any (arbitrarily nested) observation space, see #1513 for additional examples (in particular, one with a Dict observation space).

The only instance where it will return a tuple of observations is whenever the observation space is not a standard Gym space (e.g. Box, Discrete, Dict, etc...), because in that case batching would depend on the nature of the data. This is a feature introduced here #2038 (see the corresponding issues in that PR for the use cases).

In general, I would prefer it if Gym adopted Stable Baselines vector environment API.

As far as I know, Gym's VectorEnv and SB3's VecEnv APIs are almost identical, because both were created on top of baseline's SubprocVec.

The one difference I can spot is that Gym's VectorEnv inherits from gym.Env, whereas SB3's VecEnv does not. This means that VectorEnv needs to implement both observation_space and action_space, and should be such that, for example, an observation from VectorEnv must be a valid element of observation_space, which is why VectorEnv's observation_space is not the observation space of a child environment (unlike in SB3's VecEnv).

Here, there is indeed a question of wether to keep the action_space as a tuple of individual spaces, or having it batched as well. Note that other than this convention for action_space (which should only matter if you have complex Tuple/Dict action spaces), under the hood Gym's AsyncVectorEnv and SB3's SubprocVecEnv have the exact same implementation.

The design of the action space of the vectorized env is cleaner in SB3 in my opinion, and I would prefer if gym adopted the SB3’ vector environment API.

I don't understand what you mean by the action space being cleaner? If it is because you can feed a numpy array directly in step_async/step, then you can do exactly the same in Gym (again, the implementation is really the same in both cases). This is valid as long as you can iterate over the action fed to step_async/step (SB3 is making the same assumption)

import gym
import numpy as np

env = gym.vector.make('MountainCarContinuous-v0', num_envs=5)
observations = env.reset()

# Here actions.shape = (5, 1) because the action space
# of MountainCarContinuous-v0 has shape (1,)
actions = np.array([[-1.], [-0.5], [0.], [0.5], [1.]])
observations, rewards, dones, infos = env.step(actions)

Regarding the first benefit, you could use SB3’s API as well (see here as an example).

This is also available in Gym (and subclasses, e.g. in AsyncVectorEnv).

I feel like a lot of the misunderstanding could be cleared up with proper documentation.

@benblack769
Copy link

benblack769 commented Jul 31, 2021

@tristandeleu

Compare:

>>> import gym
>>>
>>> import numpy as np
>>>
>>>
>>>
>>> env = gym.vector.make('MountainCarContinuous-v0', num_envs=5)
>>> env.action_space
Tuple(Box(-1.0, 1.0, (1,), float32), Box(-1.0, 1.0, (1,), float32), Box(-1.0, 1.0, (1,), float32), Box(-1.0, 1.0, (1,), float32), Box(-1.0, 1.0, (1,), float32))
>>> env.observation_space
Box(-1.2000000476837158, 0.6000000238418579, (5, 2), float32)
>>>

To:

>>> import stable_baselines3
>>> vec_env = stable_baselines3.common.vec_env.DummyVecEnv([lambda: gym.make("MountainCarContinuous-v0")]*5)
>>> vec_env.action_space
Box(-1.0, 1.0, (1,), float32)
>>> vec_env.observation_space
Box(-1.2000000476837158, 0.6000000238418579, (2,), float32)

Please reread the above complaints with this in mind.

@tristandeleu
Copy link
Contributor

Just to emphasize that the action_space as a Tuple or a batch in VectorEnv is really just a matter of convention, if you batch the action space (see the diff below), then all the tests are still passing.

diff --git a/gym/vector/vector_env.py b/gym/vector/vector_env.py
index 375826f..7d09815 100644
--- a/gym/vector/vector_env.py
+++ b/gym/vector/vector_env.py
@@ -33,7 +33,7 @@ class VectorEnv(gym.Env):
         super(VectorEnv, self).__init__()
         self.num_envs = num_envs
         self.observation_space = batch_space(observation_space, n=num_envs)
-        self.action_space = Tuple((action_space,) * num_envs)
+        self.action_space = batch_space(action_space, n=num_envs)

         self.closed = False
         self.viewer = None

Please reread the above complaints with this in mind.

That was my point in the comparison between Gym and SB3: in SB3's VecEnv, an action you feed to step is not a valid element of action_space (which is ok, since VecEnv does not inherit from gym.Env, so action_space and observation_space may have different semantics). In that example, you will feed a numpy array of shape (5, 1) to DummyVecEnv, not a numpy array of shape (1,). In Gym's VectorEnv (just like any gym.Env), under the convention of having a batched space (i.e. the diff above is applied), the action is going to be a valid element of action space, since:

# With the diff above
env = gym.vector.make('MountainCarContinuous-v0', num_envs=5)
print(env.action_space)
# Box(-1.0, 1.0, (5, 1), float32)

@cpnota
Copy link

cpnota commented Jul 31, 2021

The issue I (or actually @benblack769, who made the changes) had with the Gym VectorEnv when using it in the autonomous-learning-library is that it does not return the observation associated with terminal states, unlike the standard API. Instead, the done flag is associated with the first observation of the next episode. This created some inconsistencies in ParallelAgent API. I would prefer VectorEnv to return the observations and rewards associated with the terminal states separately from the first state of the next episode.

@cpnota
Copy link

cpnota commented Jul 31, 2021

On the subject of action_space:

>>> vec_env.action_space
Box(-1.0, 1.0, (1,), float32)
>>> vec_env.observation_space
Box(-1.2000000476837158, 0.6000000238418579, (2,), float32)

I agree this is a pretty ugly inconsistency. I see @tristandeleu just opened a PR to address this.

@tristandeleu
Copy link
Contributor

The issue I (or actually @benblack769, who made the changes) had with the Gym VectorEnv when using it in the autonomous-learning-library is that it does not return the observation associated with terminal states, unlike the standard API. Instead, the done flag is associated with the first observation of the next episode.

That's a good point, I agree this is not ideal. Originally I had a key in the info dictionary with the last observation #1513 (comment) (it was tied to another feature dropped though), but I removed it altogether to match baselines' implementation. I have nothing against adding a key to the info dict again, similar to how SB3 does it.

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Jul 31, 2021

@tristandeleu you have a valid point. The action with shape (1,) is indeed not a valid action to step, and I think using the batch_space would make the API better!

That said, jus throwing some nit thoughts about the usability. Currently, I have been writing code like the following with SB3's API:

self.actor = nn.Sequential(
    layer_init(nn.Linear(np.array(envs.observation_space.shape).prod(), 64)),
    nn.Tanh(),
    layer_init(nn.Linear(64, 64)),
    nn.Tanh(),
    layer_init(nn.Linear(64,  envs.action_space.n), std=0.01),
)

I can do things like nn.Linear(np.array(envs.observation_space.shape).prod(), 64) because the "true" observation space stays the same regardless of the number of environments or the number of samples used during training.

Using the new batch_spaceed API, I imagine I would need to refactor it to something like nn.Linear(np.array(envs.observation_space.shape[1:]).prod(), 64) Would this be confusing?

We would need to clarify that the first dimension of observation_space's shape is a "batch" dimension and therefore should be ignored during the NN constructions. Maybe with more documentation then this is a non-issue.

@tristandeleu
Copy link
Contributor

@vwxyzjn the properties single_observation_space and single_action_space got you covered!
Depending on how much flexibility you need, you can also combine that with the flatdim utility function. Something like

self.actor = nn.Sequential(
    layer_init(nn.Linear(flatdim(envs.single_observation_space), 64)),
    nn.Tanh(),
    layer_init(nn.Linear(64, 64)),
    nn.Tanh(),
    layer_init(nn.Linear(64, envs.single_action_space.n), std=0.01),
)

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Jul 31, 2021

Oh wow, didn't know this existed. Thanks @tristandeleu that really helps!

My remaining concern is the wrappers. SB3 already has implemented a list of wrappers for the vectorized environment here. Something like VecMonitor is critical for recordings stats such as episodic return for the procgen environment.

@tristandeleu
Copy link
Contributor

tristandeleu commented Jul 31, 2021

My remaining concern is the wrappers. SB3 already has implemented a list of wrappers for the vectorized environment here. Something like VecMonitor is critical for recordings stats such as episodic return for the procgen environment.

There is a VectorEnvWrapper, but at the moment there is no specific wrappers for vector environments that use it in Gym. I think it would be great to add some of these wrappers from SB3 to Gym, they are very useful.

One thing to note though is that since VectorEnv inherits from gym.Env, some wrappers (inheriting from gym.Wrapper) are already compatible with VectorEnv. For example, you can wrap FrameStack around a VectorEnv without having a specific VectorEnvWrapper #2128 (comment).

@araffin
Copy link
Contributor

araffin commented Aug 1, 2021

You might want to check out https://github.com/openai/gym3

the main issue with gym3 (in addition to breaking gym api) is that terminal observation are not handled apparently (see discussion in DLR-RM/stable-baselines3#311 (comment))

Apart from that, I don't have much to say about Gym's VecEnv as we will probably keep SB3 internal ones as it is a core feature of SB3 (we need full control of it to ensure correctness) and we sometimes need to tweak it to our own needs (which may not fit all needs).

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Aug 2, 2021

If at all possible, could we try to unite the vectorized environment implementation @araffin @tristandeleu? They are all doing the same thing, and we should try to have a single source of the vectorized env, where non-common utilities can be created via subclassing.

@tristandeleu
Copy link
Contributor

Both APIs are really close to one another, but have some divergences still (observation_space/action_space discussed above is one of them). Assuming that those could be resolved to ensure compatibility with the rest of the code, especially wrappers, in theory it should be possible to have SB3 subclassing Gym's VectorEnv to add SB3-specific features (e.g. env_is_wrapped, get_images). This would probably require a number of changes on Gym's side as well to ensure proper extendability to fit those needs (e.g. env_is_wrapped would probably require a custom endpoint in the worker); while this is already possible with Gym's VectorEnv (via a custom worker for AyncVectorEnv), this is not very user-friendly for SB3's developers. This would likely require a lot of work on their side, which is not ideal.

This is not unusual to see large RL libraries run their own version of vector environments (e.g. pfrl, TF-Agents), or defer that logic to specific samplers (e.g. rlpyt, garage). I completely understand that @araffin wants to have full control over SB3's implementation of vector environments, where they need to make sure nothing breaks on their side with a new version of Gym. After all, we're having a discussion here in a thread named "Vector Environment API Breaking Changes"...

The goal of VectorEnv originally was to limit the common pattern when starting a new project (without a RL library) to copy-paste baselines' SubprocVecEnv, and have a solution available directly in Gym, not necessarily to offer a universal solution to sequential/parallel execution. That being said, I still believe that the very first step to improve adoption of Gym's VectorEnv is documentation.

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Aug 5, 2021

@tristandeleu thanks for the well-considered reply. I completely agree with the first step being adding more documentations. Now that I was aware of the single_observation_space and single_action_space, it became much more strightforward to adopt the VectorEnv. I did a quick modification and it worked out of the box (see this diff, -12 Removals+12 Additions of code)

I feel another reasobale next step is to add more common utilities such as the VecMonitor on the gym's side, which I am happy to take initiative. With more utilities and documentation, I feel the VectorEnv will reach wider community adoption.

@tristandeleu
Copy link
Contributor

tristandeleu commented Aug 5, 2021

The issue I (or actually @benblack769, who made the changes) had with the Gym VectorEnv when using it in the autonomous-learning-library is that it does not return the observation associated with terminal states, unlike the standard API. Instead, the done flag is associated with the first observation of the next episode. This created some inconsistencies in ParallelAgent API. I would prefer VectorEnv to return the observations and rewards associated with the terminal states separately from the first state of the next episode.

Regarding this, I just realized that there is this PR #1632 already open, which adds the last observation as a key to info. I have commented here #1632 (comment) on how we can return with this PR the last observation as an object which has the same type and shape as a regular batched observation.

@jkterry1
Copy link
Collaborator Author

jkterry1 commented Aug 7, 2021

@tristandeleu
Copy link
Contributor

I have opened a PR for adding an extensive documentation of the current vectorized environments #2327, with many examples. Feedback would be very welcome!

@tristandeleu
Copy link
Contributor

Here is the documentation for vectorized environments on Github Pages for convenience: https://tristandeleu.github.io/gym/vector/

@jkterry1
Copy link
Collaborator Author

@tristandeleu Thanks, this is helpful to have. We're still steadily working on doing the originally planned Jekyll based website and we'll merge your sphinx docs into that when the time comes.

@tristandeleu
Copy link
Contributor

@jkterry1 Based on the new roadmap #2524

  • Redone vector API so that people will actually use it (details entirely TBD)

Can you be more precise with what you think should be changed in the vector API? Since you are not contributing to the discussion here, it is impossible to know exactly what would be the changes you claim are necessary. From what I can tell, all the concerns here have been addressed one way or another, and I wonder how redoing the vector API would help.

@jamartinh
Copy link

Think on the auto-reset way for envpool. It looks like having more sense.

https://envpool.readthedocs.io/en/latest/content/python_interface.html#auto-reset

EnvPool enables auto-reset by default. Let’s suppose an environment that has a max_episode_steps = 3. When we call env.step(action) five consecutive times, the following would happen:

  1. the first call would trigger env.reset() and return with done = False and reward = 0, i.e., the action will be discarded;
  2. the second call would trigger env.step(action) and elapsed step is 1;
  3. the third call would trigger env.step(action) and elapsed step is 2;
  4. the fourth call would trigger env.step(action) and elapsed step is 3. At this time it returns done = True and (if using gym) info["TimeLimit.truncated"] = True;
  5. the fifth call would trigger env.reset() since the last episode has finished, and return with done = False and reward = 0, i.e., the action will be discarded.

So there's no need to including observation into info, and the iteration will flow normaly with returns of step as they should be and returns of reset as they should be.

@pseudo-rnd-thoughts
Copy link
Contributor

Am slightly confused by your comment. What is this meant to be in response to? I believe that Gym already implements an autoreset in the AsyncVectorEnv and SyncVectorEnv classes

@jamartinh
Copy link

jamartinh commented Jul 16, 2022

Am slightly confused by your comment. What is this meant to be in response to? I believe that Gym already implements an autoreset in the AsyncVectorEnv and SyncVectorEnv classes

Yes but Gym does not return the corresponding observation at the time it returns done=True, instead Gym returns an observation corresponding to a reset() and done=True.

That is inconsistent desing and should be avoided, which is what envpool does it correctly.

Check steps 4 and 5, there is a clear difference between what Gym is doing and what envpool is doing.
In Gym you have to overcome this by artificially getting the last obsevation from "info" dict which breaks the normal flow.

if terminated or truncated:

@pseudo-rnd-thoughts
Copy link
Contributor

Ok, so this is a proposal about how gym should implement the autoreset functionality in the vector env.
While I agree that this will be faster as there is only ever one function call, step or reset, this does result in having to compute the agent's action for an observation that is a termination observation.
But I believe that the tradeoff is worth it

@jamartinh
Copy link

MM, not sure on having to compute a new action when terminal, done or whatever signals True. I guess perhaps the "collector" or "rollout" part of the experimental code should be in charge of this.

Perhaps I am missing something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants