|
| 1 | +# Wide Expert Parallelism (Wide-EP) in TensorRT-LLM |
| 2 | + |
| 3 | +TensorRT-LLM's Wide Expert Parallelism (Wide-EP) feature enables efficient inference of large-scale Mixture-of-Experts (MoE) models by scaling expert parallelism beyond traditional limits. This feature addresses the inherent workload imbalance challenges in large-scale MoE models and provides both offline and online load balancing capabilities. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges for inference systems: |
| 8 | + |
| 9 | +- **High memory demands** for expert weights |
| 10 | +- **Inherent expert-level workload imbalance** due to sparse execution patterns |
| 11 | +- **Communication overhead** in distributed expert parallelism |
| 12 | + |
| 13 | +Wide-EP solves these challenges through: |
| 14 | + |
| 15 | +- **Custom EP communication kernels** optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL) |
| 16 | +- **Expert Parallelism Load Balancer (EPLB)** with both offline and online modes |
| 17 | +- **Dynamic expert placement and replication** strategies |
| 18 | +- **Layer-wise weight redistribution** to minimize inference disruption |
| 19 | + |
| 20 | +## Quick Start |
| 21 | + |
| 22 | +### 1. Configurations |
| 23 | + |
| 24 | +An example yaml file to enable wide EP: |
| 25 | +```yaml |
| 26 | +moe_config: |
| 27 | + backend: WIDEEP |
| 28 | + max_num_tokens: 9216 |
| 29 | + load_balancer: moe_load_balancer.yaml # (optional) enable load balancer |
| 30 | +``` |
| 31 | +
|
| 32 | +| Parameter | Description | Default | Notes | |
| 33 | +|-----------|-------------|---------|-------| |
| 34 | +| `backend` | MoE backend type | `CUTLASS` | Set to `WIDEEP` to enable wide EP | |
| 35 | +| `max_num_tokens` | If set, at most max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time. | `None` | If the number of tokens exceeds max_num_tokens, the input tensors will be split into chunks and a for loop will be used. | |
| 36 | +| `load_balancer` | Configuration for MoE load balancing | `None` | Set path to the yaml file | |
| 37 | + |
| 38 | +#### Load Balancer Configuration |
| 39 | + |
| 40 | +An example `moe_load_balancer.yaml` file to configure online EP balancer: |
| 41 | +```yaml |
| 42 | +num_slots: 288 |
| 43 | +layer_updates_per_iter: 1 |
| 44 | +``` |
| 45 | + |
| 46 | +| Parameter | Description | Default | Notes | |
| 47 | +|-----------|-------------|---------|-------| |
| 48 | +| `num_slots` | Total number of expert slots | `None` | Must be ≥ total experts | |
| 49 | +| `layer_updates_per_iter` | Number of layers updated per iteration | `0` | `0` = offline, `>0` = online | |
| 50 | + |
| 51 | +Refer to the [ep_load_balancer](./ep_load_balancer/) directory for more details on EP load balancer. |
| 52 | + |
| 53 | +### 2. Execute Wide-EP on SLURM Clusters |
| 54 | + |
| 55 | +Refer to the [slurm_scripts](./slurm_scripts/) directory, which reuses [disaggregated slurm scripts](../disaggregated/slurm/) to automatically generate configuration files and submit jobs to SLURM clusters. |
| 56 | + |
| 57 | +## Trouble shooting |
| 58 | + |
| 59 | +### Transparent HugePages failure |
| 60 | + |
| 61 | +When getting exception `madvise(MADV_HUGEPAGE) failed.`, check if Transparent Hugepages has been enabled. |
| 62 | +```bash |
| 63 | +>$ cat /sys/kernel/mm/transparent_hugepage/enabled |
| 64 | +always [madvise] never |
| 65 | +>$ cat /sys/kernel/mm/transparent_hugepage/defrag |
| 66 | +always defer defer+madvise [madvise] never |
| 67 | +``` |
| 68 | +If `never` is highlighted, enable Transparent HugePages by the following command. |
| 69 | +```bash |
| 70 | +echo madvise > /sys/kernel/mm/transparent_hugepage/enabled |
| 71 | +``` |
| 72 | + |
| 73 | +### Disaggregated serving related issues |
| 74 | + |
| 75 | +Refer to the [Troubleshooting and FAQ](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md#troubleshooting-and-faq) section of Disaggregated-Service. |
| 76 | + |
| 77 | +## References |
| 78 | + |
| 79 | +- [Technical Blog: Scaling Expert Parallelism in TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) |
| 80 | + |
| 81 | +For detailed implementation examples and advanced usage, see the subdirectories: |
| 82 | +- [`ep_load_balancer/`](ep_load_balancer/): Load balancing tools and examples |
| 83 | +- [`slurm_scripts/`](slurm_scripts/): Cluster deployment scripts |
0 commit comments