-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Doc revamp #883
Comments
Hi there. We are building a multi-agent environment based on Nvidia Isaac Gym and want to integrate it with torchrl somehow. So we are more than glad to help with 2. and 3., and thus would like to know your idea on what a multi-agent interface should look like and how and where ensembling could take place accordingly. The solution should be general enough to cover the following cases:
The first to-be-addressed questions I can think of:
Specifically for the example/tutorial, I think multi-agent PPO with actors of the same structure (so we can showcase model ensembling) and a single centralized critic could be a good one. What do you think? |
This sounds wonderful. The ambition I have for MARL and multitask is to treat envs like we treat tensors: being able to stack them, or perhaps in some cases to index them. The specs would follow a similar logic: you can already expand specs, in the future I'd like to be able to stack them (even if heterogeneous, like we do with tensordict). On your side let me know how I can help! |
@btx0424 Here is a list of issues me and @vmoens have focused on to pave the way to multi-agent #777 . Happy to discuss the next steps and curious about any suggestion |
I can see that you have made impressive efforts to keep everything in a (nested-) tensor-like structure, and After reading through #777, the things that come into my mind are:
As I understand, here we are faced with whether to provide an
A solution would be providing an Nonetheless, I'm definitely a fan of the stacking-and-indexing way now. In MARL we sometimes want a policy pool or something like the league training in AlphaStar. Essentially it is executing policies with the same structure but different params or states. Imagine we can
|
Does this imply something different than a done state with a shape that matches the env shape?
that's an interesting point. IIUC you're saying that an env with a batch size
I'm not entirely sure of what this would look like.
This makes me curious. Why would it be confusing? What is the alternative? Anyhow: I'm super excited to see how productive these conversations are! cc @PaLeroy FYI |
@btx0424 thanks for taking the time to read through everything. You made many interesting points, let me start addressing a few.
|
I also really like the point of being able to assign different policies to different sub-envs , sub-tasks or agents. So this would mean passing to a collector a bunch of policies each with a slice of the env.batch_size it should operate on. |
This dilemma of having the agents in the batch_size or not has been following me for a while and I am curious to get everyone’s thoughts early on to see if it is worth going this way ors just having the agents as a dimension oitside the batch_size. Here are the pro and cons i can think of: PROS
CONS
|
This is supported by TensorDictSequential provided that agents have observation keys that differ: the module will route the observations to the appropriate policy as needed. It also supports parameter tying and vmap. |
Replaying to @vmoens and #897.
I wouldn't say that. I think the agent shouldn't have any contact with the env code variable. And that the env shouldn't have to be derived from a base class. Like in the tutorial in the following code:
I think the agent should not get the entire env object but just the bare minimum of information it needs to construct the networks - the state and action shape and maybe the action min/max(but maybe not even that, it could just output [-1, 1] and then it would be up to the user to put that into the correct range). This would eliminate the need for the environment to be changed to fit a certain interface and would put all the responsibility on the programmer which I find more intuitive.
Hmm. I actually don't know. You might be right. Personally, I feel like I've always had an environment I wanted to work on and found the Gym envs annoying to install. I think I'd appreciate more an example that assumes an environment has been loaded already and just applies the agent to it. I know when I was reading torch tutorials that I also never liked loading the built-in datasets like MNIST but liked creating my own data. But to each his own and I might just be weird.
I totally get what you're saying. I think little examples directly in the docs page of the object would be the best. Just like how to instantiate the object and how to call it or something. For example like this docs page for MSELoss. I hope that helped. And thanks for your time, I really appreciate it. |
Right, but don't read too much in that helper function. This is a simple code snippet aimed at creating an env in a specific context, ie it is not part of the library as a core component. Again: we could do a better job at explaining how to use the primitives, but helper functions should not be seen as indications of how to build a training pipeline. They are merely there to tie the pieces together.
Good to know. I'm working on a "make your own env" tutorial, without the need to install any other library. |
The picture is getting clearer. I think it's possible to make a unifying API with some effort. Let's address the problems one by one. Throughout the following discussion, we assume we are dealing with a vectorized multi-agent environment with E envs, two types of agents A1 and A2, and an observation_dim D. Regarding the import torch
from torchrl.data import CompositeSpec, UnboundedContinuousTensorSpec
E, (A1, A2), D = 1024, (2, 3), 16
obs_spec = torch.stack([
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A1)],
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A2)],
CompositeSpec(state=UnboundedContinuousTensorSpec(D * (A1+A2))) # should go to a centralized critic
], dim=0).expand(E, A1+A2+1)
obs_spec.shape # torch.Size([1024, 6]) Another option is to have a separate But now I think this is not really a problem and I'll try to show why later. Let's drop the state and proceed. With the above import torch
import torch.nn as nn
import functorch
from torchrl.data.tensor_specs import UnboundedContinuousTensorSpec
from tensordict import TensorDict
from tensordict.nn import make_functional
from typing import Tuple
from copy import deepcopy
E, (A1, A2), D = 1024, (2, 3), 16
input_spec = torch.stack([
*[UnboundedContinuousTensorSpec(D) for _ in range(A1)],
*[UnboundedContinuousTensorSpec(D) for _ in range(A2)],
], dim=0)
def stack(modules: Tuple[nn.Module, ...]):
if not len(set(m.__class__ for m in modules)) == 1:
raise ValueError("Currently only stacking homogeneous modules is supported.")
params = torch.stack([make_functional(deepcopy(m)) for m in modules])
m = deepcopy(modules[0])
make_functional(m)
func = functorch.vmap(m) # a lot to do here to ensure the desired behavior
# we might want to assign a `shape` to the "stacked" func
return func, params
x = input_spec.rand()
a1 = nn.Linear(D, 1)
a2 = nn.Linear(D, 1)
joint_policy, params = stack([
*[a1 for _ in range(A1)],
*[a2 for _ in range(A2)]
])
y1 = joint_policy(x, params)
y2 = torch.cat([
a1(x[:A1]),
a2(x[A1:])
])
assert torch.allclose(y1, y2, atol=1e-7) # the results are slightly different, though Ideally, this parallel inference could provide a considerable speedup. It also saves the trouble of having to index the input to do policy-agent matching.
As for the from torchrl.envs import EnvBase as _EnvBase
from torchrl.data import TensorSpec
from dataclasses import dataclass
from typing import List
@dataclass
class AgentSpec:
name: str
n_agents: int
observation_spec: TensorSpec
action_spec: TensorSpec
reward_spec: TensorSpec
class EnvBase(_EnvBase):
agent_specs: List[AgentSpec] = []
def register_agent(
self, agent_spec: AgentSpec
):
# completely optional
# 1. the named-agents way
self.agent_specs.append(agent_spec)
def expand(spec: TensorSpec):
return spec.expand(*self.batch_size, agent_spec.n_agents, *spec.shape)
self.observation_spec[f"{agent_spec.name}.obs"] = expand(agent_spec.observation_spec)
self.action_spec[f"{agent_spec.name}.action"] = expand(agent_spec.action_spec)
self.reward_spec[f"{agent_spec.name}.reward"] = expand(agent_spec.reward_spec)
# 2. the StackedSpec way
# not sure what it exactly looks like for now, but I believe it is doable and makes good sense
# so that we can do
env = EnvBase(...)
policies = []
for agent_spec in env.agent_specs:
policies.extend([make_model(agent_spec) for _ in range(agent_spec.n_agents)]) # the more natural individual view
joint_policy = torch.stack(policies) Putting things together, I think you can see my point about how it makes policy construction more straightforward. By design, we can make this completely optional in case the user wants to specify everything manually. So it won't trouble people outside the MARL context.
Up to now, we have been assuming a homogeneous setting. However, once we can generalize to heterogeneous cases where we can correctly stack obs_spec = torch.stack([
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A1)],
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A2)],
CompositeSpec(state=UnboundedContinuousTensorSpec(D * (A1+A2))) # should go to a centralized critic
], dim=0).expand(E, A1+A2+1)
actor_critic = stack([
*[make_actor(...) for _ in range(A2)],
*[make_actor(...) for _ in range(A1)],
make_critic(...)
]) I have been really getting inspiration from this project and the discussion. It would be great to get started by making an example/tutorial with VMAS once #892 is ready and see what problems might pop out. What do you think? (btw it looks like |
I think you might have lost me here already. Where is D gone? How does that expand shape fit? Are you proposing a batch dimension of |
Sorry for the delayed reply. The example was coded with the implementation of #892 at the time. I went back and found that currently, the behavior of stacking from torchrl.data import CompositeSpec, UnboundedContinuousTensorSpec
E, (A1, A2), D = 1024, (2, 3), 16
obs_spec = torch.stack([
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A1)],
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A2)],
# CompositeSpec(state=UnboundedContinuousTensorSpec(D * (A1+A2)))
], dim=0)
print(obs_spec.shape) # [A1+A2], so obs_spec.exand(E, A1+A2) works
input_spec = torch.stack([
*[UnboundedContinuousTensorSpec(D) for _ in range(A1)],
*[UnboundedContinuousTensorSpec(D) for _ in range(A2)],
], dim=0)
print(input_spec.shape) # [A1+A2, D], have to do input_spec.expand(E, A1+A2, D) So D is indeed gone in the composite case. I'm unsure which makes more sense here, but I would vote for the first.
No. It is just an example of what commonly happens when we want to provide some additional input that is not of shape So I am proposing:
Regarding stacking Ideally it should look like: a1 = nn.Linear(D, 1)
a2 = nn.Linear(D, 1)
policy = torch.cat([
a1.expand(A1),
a2.expand(A2),
]) # returns a lazily concat-ed object so tha we don't actually copy those parameters
policy.shape # [A1+A2] |
bouncing back on this: it is only gone because you did not assign a shape to
the behaviour is consistent. For stacking layers, the way we approach this is the following module1 = TensorDictModule(a1, in_keys=["key1"], out_keys=["embed"])
module2 = TensorDictModule(a2, in_keys=["key2"], out_keys=["embed"])
module = TensorDictSequential(module1, module2, partial_tolerant=True)
data = torch.stack([
TensorDict({"key1": torch.randn(D)}, []),
TensorDict({"key2": torch.randn(D)}, [])
], 0)
module(data) When |
Sorry for the delayed reply. I think that works for me.
Guess in the end we have to give agents names somewhere. But you've managed to make things as easy as possible. That's great! I will digest this a bit and try to work out a multi-agent PPO example/tutorial that looks similar to the PPO tutorial and demonstrates the ideas we have discussed above. |
Motivation
Plan for the doc revamp
The text was updated successfully, but these errors were encountered: