Multiverse is a GPU-accelerated AI training simulator with high performance and high fidelity. It enables researchers to evaluate the performance of AI/LLM training systems given:
-
Parallel strategies: TP, DP.
-
Collective communication algorithms: Ring allreduce.
-
Topology: Fattree, Bcube, HPN.
-
Congestion control algorithms: DCQCN.
It's worth mentioning that Multiverse can also run completely on the CPUs, and it still applies Data-oriented Design (DOD) ideas to improve simulation efficiency.
Below is the architecture of Multiverse 1.0. In the coming months we will release Multiverse 2.0, which is fully GPU-accelerated and supports more LLM training options.
-
System Simulator: Built on the
astrasim-1.0-ns3
framework for LLM framework-level simulation, added shared memory communication method and adapted the new network layer implementation to support the new network simulator, currently it runs on CPUs, in the future it will also be accelerated by GPU. -
Network Simulator: Accelerated by Data-oriented Design (DOD) paradigm and GPU. It also can be run in CPU, which means that both System Simulator and Network Simulator are run in CPU.
-
Shared Memory Communication: Efficient data transfer between the System Simulator and the Network Simulator via shared memory.
-
astra-sim
: Contains the all code forastrasim-1.0-ns3
. -
astra-sim\network_frontend
: Includes the new implementation code for the network layer interface.
-
external
: Include some ECS engine repos. -
src
: The network simulator based on ECS engine repos and cuda. -
scripts
: The start scripts for simulation.
So far, the system simulator and the network simulator require different specific running environments.
1. System simulator
Pull docker from docker hub and create a docker container named as multiverse_ss.
docker pull naspthu/multiverse-ss:1.0.0 (or docker pull cjie.eu.org/naspthu/multiverse-ss:1.0.0)
docker run -itd --privileged -v path_to_workdir:path_to_workdir -p 11110:22 --gpus all --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --cap-add SYS_NICE --cap-add IPC_LOCK --ipc=host --restart=unless-stopped --name multiverse_ss naspthu/multiverse-ss:1.0.0 /bin/bash
Start up the container, clone the code using git, and then perform the compilation.
sudo docker exec -it multiverse_ss /bin/bash
cd ./path_to_workdir
git clone [email protected]:Harnets/multiverse.git
git submodule init
git submodule update multiverse_system
cd path_to_workdir/multiverse/multiverse_system
cd astra-sim
git submodule init
git submodule update
cd extern/network_backend/ns3/simulation/
./waf configure (only required once)
cd path_to_workdir/multiverse/multiverse_system
./build/astra_ns3/build.sh -c
2. Network simulator
Pull docker from docker hub and create a docker container named as multiverse_net.
docker pull naspthu/multiverse-net:latest (or docker pull cjie.eu.org/naspthu/multiverse-net:latest)
docker run -itd --privileged -v path_to_workdir/multiverse/:path_to_workdir/multiverse/ -p 11111:22 --gpus all --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --cap-add SYS_NICE --cap-add IPC_LOCK --ipc=host --restart=unless-stopped --name multiverse_net naspthu/multiverse-net:latest /bin/bash
Start up the container, clone the code using git, and then perform the compilation.
sudo docker exec -it multiverse_net /bin/bash
cd ./path_to_workdir
git clone --recursive [email protected]:Harnets/multiverse.git
cd path_to_workdir/multiverse/multiverse_network/
mkdir build
cd build
cmake ..
make -j
cd ..
pip install -e .
1. Prepare the running environment
The process can be seen in the README of multiverse system
2. Download the repo
git clone [email protected]:Harnets/multiverse.git
3. Complete the build for the system simulator
cd path_to_workdir/multiverse/multiverse_system
git submodule init
git submodule update multiverse_system
cd path_to_workdir/multiverse/multiverse_system
cd astra-sim
git submodule init
git submodule update
cd path_to_workdir/multiverse/multiverse_system/extern/network_backend/ns3/simulation/
./waf configure (only required once)
cd path_to_workdir/multiverse/multiverse_system
./build/astra_ns3/build.sh -c
1. Prepare the running environment
The process can be seen in the README of multiverse network
2. Clone the repository:
git clone --recursive [email protected]:Harnets/multiverse.git
3. Build the network simulator:
cd path_to_workdir/multiverse/multiverse_network/
mkdir build
cd build
cmake ..
make -j
cd ..
pip install -e .
Note: You need to first start the system simulator, then run the network simulator. The order matters!
1. Start system simulator
cd path_to_workdir/multiverse/multiverse_system
./build/astra_ns3/build.sh -d
2. Start network simulator
cd path_to_workdir/multiverse/multiverse_network/
bash run.sh ("--enable_gpu_sim" controls the network simulator is run on CPUs or GPUs)
-
Workload configuration: Please refer to
astra-sim/scripts/workload_generator/README.md
for details. -
Network simulator configuration: You can configure the topology, congestion control algorithms, etc. For example, to set the fattree topology and CC algorithm:
python scripts/run.py --num_env $num_env --enable_gpu_sim 'cpu' --fattree_K $fattree_K --cc_method $cc_method
Parameter | Description | Default Value |
---|---|---|
-tp --tp_size |
TP group size | 4 |
-dp --dp_size |
DP group size | 32 |
-w --workload |
Path to workload | ./A100-128-Megatron-GPT13B-tp4-pp1-dp32.txt |
Parameter | Description | Default Value |
---|---|---|
`MAX_NPU_FLOW_NUM | Max length of queue for store the new/finished flow event | 2000(must > #NPU) |
-k --fattree_k |
Fattree K | 8 |
-c --cc_method |
congestion control method(0 is DCQCN, 1 is HPCC, ...) | 0 |
1. Start system simulator
./build/astra_ns3/build.sh -d
2. Start network simulator
bash run.sh
Strategy/Algorithm | TP✔️ | DP✔️ | PP | EP✔️ |
Collective Communication | Ring allreduce✔️ | Halving doubling allreduce | Binary tree allreduce | / |
Topology | Fattree✔️ | HPN | Rail-optimized | Bcube |
Congestion Control | DCQCN✔️ | HPCC | Timely | Poseidon |
Scale up network DES | PCIe | NVLINK | TTPoE | UALink |
✔️ indicates that the corresponding function is ready in Multiverse 1.0. Others are future works.
Contributions are welcome! Feel free to report issues or submit pull requests. Please follow our contribution guidelines.
This project is licensed under the MIT License.