NeurIPS 2024 D&B Track - ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

Overview

This repository is the official implementation of ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination.

ZSC-Eval is a comprehensive and convenient evaluation toolkit and benchmark for zero-shot coordination (ZSC) algorithms, including partner candidates generation via behavior-preferring rewards, partners selection via Best-Response Diversity (BR-Div), and ZSC capability measurement via Best-Response Proximity (BR-Prox).

This repo includes:

Evaluation Framework
- Generation and Selection of Behavior-preferring Evaluation Partners
- Measurement of ZSC capability via Best-Response Proximity and other metrics
Environments Support
- Overcooked-ai 🧑‍🍳
- Overcooked-ai with Multiple Recipes 🧑‍🍳 (New Coordination Challenge!)
- Google Research Football ⚽️
ZSC Algorithms Implementation
A Human Study Platform
- Real-time Overcooked game play
- Subjective Ranking
- Trajectories Collection
Benchmarks
- Benchmark of ZSC Algorithms under ZSC-Eval
- Benchmark of ZSC Algorithms under Human Evaluation

🗺️ Supported Environments

🧑‍🍳 Overcooked

Overcooked is a simulation environment for reinforcement learning derived from the Overcooked! video game and popular for coordination problems.

The Overcooked environment features a two-player collaborative game structure with shared rewards, where each player assumes the role of a chef in a kitchen, working together to prepare and serve soup for a team reward.

We further include Overcooked games with multiple recipes, in which agents should decide the schedule of cooking different recipe for higher rewards.

⚽️ Google Research Football

Google Research Football (GRF) is a simulation environment for reinforcement learning based on the popular football video game. We choose the Football Academy 3 vs. 1 with Keeper scenario and implement it as a ZSC challenge.

📖 Installation

To install requirements:

ZSC-Eval and Overcooked

conda env create -f environment.yml

Google Research Football

./install_grf.sh

📝 How to use ZSC-Eval for Evaluating ZSC Algorithms

After installation, here is the steps to use ZSC-Eval for evaluating the ZSC algorithms. We use the Overcooked Environment as an example.

cd zsceval/scripts/overcooked

Setup the Policy Config

gen policy_config for each layout

bash shell/store_config.sh {layout}
#! modify the layout names
bash shell/mv_policy_config.sh

An Example of policy_config

Policy Config Example

Prepare the Evaluation Partners

train behavior-preferring agents

bash shell/train_bias_agents.sh {layout}

extract agent models

cd ..
python extract_models/extract_bias_agents_models.py {layout}
python prep/gen_bias_agent_eval_yml.py {layout}
cd overcooked

evaluate the agents and get policy behaviors

bash shell/eval_bias_agents_events.sh {layout}

select evaluation partners and generate evaluation ymls

cd ..
python prep/select_bias_agent_br.py --env overcooked --layout {layout} --k 10 --N 1000000

Copy the results in zsceval/scripts/prep/results/{layout} to zsceval/utils/bias_agent_vars.py.

Generate benchmark yamls:

python prep/gen_bias_agent_benchmark_yml.py -l {layout}

train BRs for mid-level biased agents

cd overcooked
bash shell/train_bias_agents_br.bash {layout}

Evaluate the ZSC Agents

We using the most common baseline, FCP, as an example.

evaluate S2 models

#! modify the exp names
bash shell/eval_with_bias_agents.sh {layout} fcp

compute final results

#! modify the exp names
cd ..
python eval/extract_results.py -a {algo} -l {layout}

🏋️ Train ZSC Algorithms

We re-implement FCP, MEP, TrajeDi, HSP, COLE and E3T as the baselines in ZSC-Eval. To train these ZSC methods, please follow the guide below:

First, replace "your wandb name" with your wandb username for convenience experiments management.

Train FCP

Stage 1

train self-play agents

cd overcooked
bash shell/train_sp.sh {layout}

extract models

cd ..
#! modify the exp names
python extract_models/extract_sp_models.py {layout} overcooked

Stage 2

generate S2 ymls

#! modify the exp names
python prep/gen_S2_yml.py {layout} fcp

train S2

cd overcooked
#! modify the exp names
bash shell/train_fcp_stage_2.sh {layout} {population_size}

extract S2 models

cd ..
#! modify the exp names
python extract_models/extract_S2_models.py {layout} overcooked

Train MEP | TrajeDi

Stage 1

generate Stage 1 population yml

python prep/gen_pop_ymls.py {layout} [mep|traj] -s {population_size}

train S1

cd overcooked
bash train_[mep|traj]_stage_1.sh {layout} {population_size}

extract S1 models

cd ..
#! modify the exp names
python extrace_models/extract_pop_S1_models.py {layout} overcooked

Stage 2

generate S2 yamls

#! modify the exp names
python prep/gen_S2_yml.py {layout} [mep|traj]

train S2

cd overcooked
#! modify the pop names
bash shell/train_[mep|traj]_stage_2.sh {layout} {population_size}

extract S2 models

cd ..
#! modify the exp names
python extract_models/extract_S2_models.py {layout} overcooked

Train HSP

generate S2 ymls

python prep/gen_hsp_S2_ymls.py -l ${layout} -k {num_bias_agents} -s {mep_stage_1_population_size} -S {population_size}

train S2

cd overcooked
bash shell/train_hsp_stage_2.sh {layout} {population_size}

extract S2 models

#! modify the exp names
python extract_models/extract_S2_models.py {layout} overcooked

Train COLE

generate COLE ymls

python prep/gen_cole_ymls.py {layout} -s {population_size}

train COLE

cd overcooked
bash shell/train_cole.sh {layout} {population_size}

extract S2 models

cd ..
#! modify the exp names
python extract_models/extract_S2_models.py {layout} overcooked

Train E3T

cd overcooked
bash shell/train_e3t.sh {layout}

We use the random3_m layout in Overcooked as an example for all generated yamls and models (.pt). The files are in random3_m.

🤖 Pre-trained Models

We also provide the pre-trained models for these baselines, you can download pre-trained models from huggingface:

cd zsceval
git clone https://huggingface.co/Leoxxxxh/ZSC-Eval-policy_pool policy_pool

👩🏻‍💻 Human Study

We implement a human study platform, including game-playing, subjective ranking, and data collection. Details can be found in zsceval/human_exp/README.md.

Web UIs

Game-playing

Ranking

Deployment

Debug Mode

export POLICY_POOL="zsc_eval/policy_pool"; python zsc_eval/human_exp/overcooked-flask/app.py

Production Mode

bash zsc_eval/human_exp/human_exp_up.sh

🛠️ Code Structure Overview

zsceval contains:

algorithms/:

population/: trainers for population-based ZSC algorithms
r_mappo/: trainers for self-play based algorithms, including SP and E3T

envs/:

overcooked/: overcooked game with single recipe
overcooked_new/: overcooked game with mutiple recipe
grf/: google research football game

runner/: experiment runers for each environment

utils/:

config.py: basic configuration
overcooked_config.py: configuration for overcooked experimenets
grf_config.py: configuration for grf experimenets

policy_pool/: training, evaluation yamls and agent models

human_exp/: human study platform

scripts/

prep/: generate yamls for training
- select_bias_agent_br.py: select evaluation partners
extract_models/: code for extracting trained agent models
render/: environment rendering
overcooked/: scripts for training and evaluating overcooked agents
- eval/: python scripts for evaluation and extraction evaluation results
  - results: benchmark results
- shell/: shell scripts for training and evaluating agents
- train/: python training scripts for each algorithm
grf/: scripts for training and evaluating grf agents
- eval/: python scripts for evaluation and extraction evaluation results
  - results: benchmark results
- shell/: shell scripts for training and evaluating agents
- train/: python training scripts for each algorithm

⚒️ How to Extend ZSC-Eval to New Environments

Firstly, the new environments should have consistent interfaces with those in Gym. Then 2 key steps are required for generating evaluation partners:

Design events that cover common behaviors in the new environment and implement event triggers for recording these events.
Implement reward calculation using linear combinations of event records and event weights, and design weights that cover common preferences in the new environment.

We use GRF as an example to provide guidelines for including new environments in ZSC-Eval.

The GRF environment is integrated in zsceval/envs/grf/:

grf_env.py: the environment wrapper to provide consistent interface with Gym.
scenarios/: ZSC scenarios.
reward_process.py: event-based reward shaping.
stats_process.py: pre-defined events recording.
raw_feature_process.py: observation processing for GRF, based on https://github.com/jidiai/GRF_MARL .
multiagentenv.py: abstract interface

reward_process.py and stats_process.py are two key modifications to include GRF in ZSC-Eval.

We argue that ZSC focuses on high-level strategies instead of low-level operations, and thus use some common statistical variables as events, including:

SHAPED_INFOS = [
    "pass",
    "actual_pass",
    "shot",
    "slide",
    "catch",
    "assist",
    "possession",
    "score",
]

stats_process.py implements triggers for each event and records the occurrence of these events, which is used in reward_process.py. reward_process.py receives user designated weights of the events, and competes the rewards that indicating behavior preferences using linear combinations. An example of a weights set is:

w0="[-5:0:1],0,[-5:0:1],0,[-5:0:1],0,0,[1:5]"

w0 indicates 38 event weight vectors, under the constraints that each weight vector has at most 3 preferred behaviors (3 non-zero weight), as shown in the following:

1: [-5.0, 0.0, -5.0, 0.0, 0.0, 0.0, 0.0, 1.0]
2: [-5.0, 0.0, -5.0, 0.0, 0.0, 0.0, 0.0, 5.0]
3: [-5.0, 0.0, 0.0, 0.0, -5.0, 0.0, 0.0, 1.0]
4: [-5.0, 0.0, 0.0, 0.0, -5.0, 0.0, 0.0, 5.0]
5: [-5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
6: [-5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0]
7: [-5.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0]
8: [-5.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 5.0]
9: [-5.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
10: [-5.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 5.0]
11: [0.0, 0.0, -5.0, 0.0, -5.0, 0.0, 0.0, 1.0]
12: [0.0, 0.0, -5.0, 0.0, -5.0, 0.0, 0.0, 5.0]
13: [0.0, 0.0, -5.0, 0.0, 0.0, 0.0, 0.0, 1.0]
14: [0.0, 0.0, -5.0, 0.0, 0.0, 0.0, 0.0, 5.0]
15: [0.0, 0.0, -5.0, 0.0, 1.0, 0.0, 0.0, 1.0]
16: [0.0, 0.0, -5.0, 0.0, 1.0, 0.0, 0.0, 5.0]
17: [0.0, 0.0, 0.0, 0.0, -5.0, 0.0, 0.0, 1.0]
18: [0.0, 0.0, 0.0, 0.0, -5.0, 0.0, 0.0, 5.0]
19: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
20: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0]
21: [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0]
22: [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 5.0]
23: [0.0, 0.0, 1.0, 0.0, -5.0, 0.0, 0.0, 1.0]
24: [0.0, 0.0, 1.0, 0.0, -5.0, 0.0, 0.0, 5.0]
25: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
26: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 5.0]
27: [0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0]
28: [0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 5.0]
29: [1.0, 0.0, -5.0, 0.0, 0.0, 0.0, 0.0, 1.0]
30: [1.0, 0.0, -5.0, 0.0, 0.0, 0.0, 0.0, 5.0]
31: [1.0, 0.0, 0.0, 0.0, -5.0, 0.0, 0.0, 1.0]
32: [1.0, 0.0, 0.0, 0.0, -5.0, 0.0, 0.0, 5.0]
33: [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
34: [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0]
35: [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0]
36: [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 5.0]
37: [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
38: [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 5.0]

The 38 weight vectors cover common preferences of football players in GRF, which is essential in evaluating ZSC capability.

Although the new environments may be complex, the triggers of events are relatively easy to implement and the high-level events and their weights are convenient to design. We call for suggestions about new multi-agent ZSC environments and are happy to include them in ZSC-Eval.

Benchmark Results

Overcooked

Overall ZSC-Eval benchmark results in Overcooked.

Human benchmark results in Overcooked.

GRF

Overall ZSC-Eval benchmark results in GRF.

Citation

@misc{wang2024zsceval,
      title={ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination},
      author={Xihuai Wang and Shao Zhang and Wenhao Zhang and Wentao Dong and Jingxiao Chen and Ying Wen and Weinan Zhang},
      year={2024},
      eprint={2310.05208},
      archivePrefix={arXiv}

Acknowledgements

We implement algorithms heavily based on https://github.com/samjia2000/HSP , and human study platform based on https://github.com/liyang619/COLE-Platform.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NeurIPS 2024 D&B Track - ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

Overview

🗺️ Supported Environments

🧑‍🍳 Overcooked

⚽️ Google Research Football

📖 Installation

📝 How to use ZSC-Eval for Evaluating ZSC Algorithms

Setup the Policy Config

Prepare the Evaluation Partners

Evaluate the ZSC Agents

🏋️ Train ZSC Algorithms

Train FCP

Train MEP | TrajeDi

Train HSP

Train COLE

Train E3T

🤖 Pre-trained Models

👩🏻‍💻 Human Study

Web UIs

Game-playing

Ranking

Deployment

Debug Mode

Production Mode

🛠️ Code Structure Overview

⚒️ How to Extend ZSC-Eval to New Environments

Benchmark Results

Overcooked

GRF

Citation

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

NeurIPS 2024 D&B Track - ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

Overview

🗺️ Supported Environments

🧑‍🍳 Overcooked

⚽️ Google Research Football

📖 Installation

📝 How to use ZSC-Eval for Evaluating ZSC Algorithms

Setup the Policy Config

Prepare the Evaluation Partners

Evaluate the ZSC Agents

🏋️ Train ZSC Algorithms

Train FCP

Train MEP | TrajeDi

Train HSP

Train COLE

Train E3T

🤖 Pre-trained Models

👩🏻‍💻 Human Study

Web UIs

Game-playing

Ranking

Deployment

Debug Mode

Production Mode

🛠️ Code Structure Overview

⚒️ How to Extend ZSC-Eval to New Environments

Benchmark Results

Overcooked

GRF

Citation

Acknowledgements