-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] TorchRL MARL API #1463
Comments
I probably missing something but will this API work with turn-based games, parallel agent games and a dynamic number of agents (that numbers of agents that take actions each turn changes)? It is possible to have games that do all of these, a turn-based game with a dynamic number of agents acting each turn. |
Great point! This API can be used as is for parallel games. For turn-based games or variable number of agents (I am grouping them under the same roof as to me they are the same (i.e. an agent dropping out is the same as an agent not having its turn)) I can envision two ways (both available to the user).
For example, we start with 2 goalies, agent_2 and agent_3 TensorDict(
"goalies": (
"obs_a": Tensor
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B, 2]),
"agent_2": (
"obs_a": Tensor,
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B]),
"agent_3": (
"obs_a": Tensor,
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B]),
"state": Tensor,
batch_size=[*B]) Now agent_2 drops out (or doesn't act) TensorDict(
"goalies": (
"obs_a": Tensor
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B, 2]),
"agent_3": (
"obs_a": Tensor,
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B]),
"state": Tensor,
batch_size=[*B]) Now the goalies drop out and agent_2 returns TensorDict(
"agent_2": (
"obs_a": Tensor
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B]),
"agent_3": (
"obs_a": Tensor,
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B]),
"state": Tensor,
batch_size=[*B]) EDIT: So maybe the masking solution is the only one feasible in all cases since we want to obtain dense tensors for training |
Also, I currently am using a game where we have a dynamic number of agents. For a bit of technical detail, I'm currently using the Unity MLAgents framework. In that framework, agents may request decisions (they need an action from the policy) at arbitrary timesteps. So maybe at timestep t=2, agent an and b request a decision and at timestep t=3 agent c and d request decisions. In my game setup, once an agent dies, it no longer request decisions. In TorchRL, I handle this dynamic requesting decision by having a I wonder if using this new MARL API though, if maybe we can make it so that new agents can be added/removed completely without the need for a valid mask; however, the behaviors are fixed. That would be a lot more flexible. |
@hyerra that is what I was getting at with the second option in my comment above. However, I think it is not feasible to remove or add agents/groups over time as stacking the data in the time dimension will be difficult. I am not sure if we can get rid of masks for turn-bsed/variable agents games even in this new API |
Closing this as inactive, happy to reopen if we need to talk about MARL API further! |
Hello everyone, this discussion is the beginning of an extension of the TorchRL MARL API.
Hope to get your feedback.
Potential TorchRL MARL API
This API proposes a general structure that multi-agent environments can use in TorchRL to pass their data to the library. It will not be enforced. Its core tenet is that data processed by the same neural network structure should be stacked (grouped) together to leverage tensor batching and data that is processed by different neural networks should be kept under different keys.
Data format
Agents have observations, done, reward and actions. These values can be processed by the same component or processed by different components. If some values across agents are processed by the same component, they should be stacked (grouped) together under the same key. Grouping happens within a TensorDict with an additional dimension to represent the group size.
Users can optionally maintain in the env a table map from each group to its members.
Let's see a few examples.
Case 1: all agents’ data is processed together
In this example, all agents data will be processed by the same neural network so it is convenient to stack them creating a tensordict with an “n_agents” dimension
In this example "agents" is the group.
It means that each tensor in “agents” will have a leading shape [*B,n_agents] and can be passed to the same neural network.
Optionally, we can maintain a map from group to agents. Supposing we have 3 agents named "agent_0", "agent_1", "agent_2", we can see that they will be all part of the "agents" group by doing
In the above example, all the keys under the "agents" group have an agent dimension. If some keys are, on the other hand, shared (like "state") they should be put in the root TensorDict outside of the group to highlight that they are missing the agent dimension. For example, if done and reward were shared by all agents we would have:
Example neural network for this case
A policy for this use case can look something like
A value network for this use case can look something like
Note that even if the agents share the same processing, different parameters can be used for each agent via the use of vmap.
This API is currently supported in TrochRL and it can be used with VMAS. You can see how in this tutorial.
Case 2: some groups of agents share data processing
Sometimes only part of the agents share the data processing. This is because agents might be physically different (heterogeneous) or have different behaviors (neural networks) associated with them (like in MLAgents). Once again we use tensordicts to group agents that share data processing
Agents can still share “reward” or “done”, in this case you can do like above and put this key out of the groups.
We can check the group membership again, in the group map we can optionally keep:
Example neural network for this case
An example policy
An example policy sharing an hidden state
This API is suited for environments with APIs using behavior or groups, such as MLAgents.
Case 3: no agents share processing (groups correspond to individual agents)
All agents can also be independent and each have their own group
again we can check that each agent belongs to a group
Example neural network for this case
Exactly like in case 2
This API is suited for environments treating agents as completely independent, such as PettingZoo parallel envs or RLlib.
Important notes (suggested)
Changes required in the library
env
,vec_env
andcollectors
#1462@hyerra @smorad @Acciorocketships @pseudo-rnd-thoughts @RiqiangGao @btx0424 @mattiasmar @vmoens @janblumenkamp
The text was updated successfully, but these errors were encountered: