Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ class RandomLoadBalance(EplbPolicy):

#### Integer Parameters

All integer input parameters must explicitly specify their maximum and minimum values and be subject to valid value validation. For example, `num_iterations_eplb_update` must be greater than 0:
All integer input parameters must explicitly specify their maximum and minimum values and be subject to valid value validation. For example, `expert_heat_collection_interval` must be greater than 0:

```python
@staticmethod
Expand Down
19 changes: 12 additions & 7 deletions docs/source/user_guide/configuration/additional_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ The following table lists additional configuration options available in vLLM Asc
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
| `finegrained_tp_config` | dict | `{}` | Configuration options for module tensor parallelism |
| `ascend_compilation_config` | dict | `{}` | Configuration options for ascend compilation |
| `eplb_config` | dict | `{}` | Configuration options for ascend compilation |
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
| `dump_config_path` | str | `None` | Configuration file path for msprobe dump(eager mode). |
| `enable_async_exponential` | bool | `False` | Whether to enable async exponential overlap. To enable async exponential, set this config to True. |
Expand All @@ -41,13 +42,6 @@ The following table lists additional configuration options available in vLLM Asc
| `SLO_limits_for_dynamic_batch` | int | `-1` | SLO limits for dynamic batch. This is new scheduler to support dynamic feature |
| `enable_npugraph_ex` | bool | `False` | Whether to enable npugraph ex graph mode. |
| `pa_shape_list` | list | `[]` | The custom shape list of page attention ops. |
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. |
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
| `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. |
| `gate_eplb` | bool | `False` | Whether to enable EPLB only once. |
| `num_wait_worker_iterations` | int | `30` | The forward iterations when the EPLB worker will finish CPU tasks. In our test default value 30 can cover most cases. |
| `expert_map_record_path` | str | `None` | Save the expert load calculation results to a new expert table in the specified directory. |
| `init_redundancy_expert` | int | `0` | Specify redundant experts during initialization. |
| `enable_kv_nz` | bool | `False` | Whether to enable kvcache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
| `layer_sharding` | dict | `{}` | Configuration options for layer sharding linear |

Expand Down Expand Up @@ -83,6 +77,17 @@ The details of each configuration option are as follows:
| `fuse_norm_quant` | bool | `True` | Whether to enable fuse_norm_quant pass. |
| `fuse_qknorm_rope` | bool | `False` | Whether to enable fuse_qknorm_rope pass. It's set to True by default when Triton is installed. |

**eplb_config**

| Name | Type | Default | Description |
| ---- | ---- | ------- | ----------- |
| `dynamic_eplb` | bool| `False`| Whether to enable dynamic EPLB. |
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in.|
| `expert_heat_collection_interval`| int | `400` | Forward iterations when EPLB begins. |
| `algorithm_execution_interval` | int | `30` | The forward iterations when the EPLB worker will finish CPU tasks. |
| `expert_map_record_path` | str | `None` | Save the expert load calculation results to a new expert table in the specified directory.|
| `num_redundant_experts` | int | `0` | Specify redundant experts during initialization. |

### Example

An example of additional configuration is as follows:
Expand Down
38 changes: 17 additions & 21 deletions docs/source/user_guide/feature_guide/eplb_swift_balancer.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,17 @@ W8A8-dynamic

### Dynamic EPLB

We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vllm eplb. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vllm eplb. Enable dynamic balancing with auto-tuned parameters. Adjust expert_heat_collection_interval and algorithm_execution_interval based on workload patterns.

```shell
vllm serve Qwen/Qwen3-235B-A22 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--additional-config '{
--additional-config '{ "eplb_config": {
"dynamic_eplb": true,
"num_iterations_eplb_update": 400,
"num_wait_worker_iterations": 30
}'
"expert_heat_collection_interval": 400,
"algorithm_execution_interval": 30
}}'
```

### Static EPLB
Expand All @@ -49,12 +49,12 @@ We need to add environment variable `export EXPERT_MAP_RECORD="true"` to record
vllm serve Qwen/Qwen3-235B-A22 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--additional-config '{
--additional-config '{ "eplb_config": {
"expert_map_record_path": "/path/to/eplb.json",
"init_redundancy_expert": 16,
"num_iterations_eplb_update": 400,
"num_wait_worker_iterations": 30
}'
"num_redundant_experts": 16,
"expert_heat_collection_interval": 400,
"algorithm_execution_interval": 30
}}'
```

#### Subsequent Deployments (Use Recorded Map)
Expand All @@ -73,9 +73,9 @@ vllm serve Qwen/Qwen3-235B-A22 \
## Critical Considerations

1. Parameter Tuning:
- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
- num_wait_worker_iterations: Should be ≥ 30 to avoid premature balancing during startup.
- init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.
- expert_heat_collection_interval: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
- algorithm_execution_interval: Should be ≥ 30 to avoid premature balancing during startup.
- num_redundant_experts: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.

2. Hardware Requirements:
- Ensure that all GPUs have identical memory capacity and compute capabilities.
Expand All @@ -85,20 +85,16 @@ vllm serve Qwen/Qwen3-235B-A22 \
- Only MoE models with explicit expert parallelism support (e.g., Qwen3 MoE models) are compatible.
- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.

4. Gating Configuration:
- When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.
- Test with synthetic workloads before production deployment.

5. Monitoring & Validation:
4. Monitoring & Validation:
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.
- Use vllm monitor to detect imbalances during runtime.
- Always verify expert map JSON structure before loading (validate with jq or similar tools).

6. Startup Behavior:
5. Startup Behavior:
- Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).
- Avoid sudden traffic spikes during the warm-up phase.

7. Common Pitfalls:
6. Common Pitfalls:
- Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.
- Using expert_map_path without generating the map first → runtime errors.
- Setting init_redundancy_expert > available GPUs → system failure.
- Setting num_redundant_experts > available GPUs → system failure.
10 changes: 6 additions & 4 deletions tests/e2e/multicard/2-cards/test_qwen3_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,10 +105,12 @@ async def test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb():
# during initialization in offline mode, so the online mode is used instead.
env_dict.update({"DYNAMIC_EPLB": "true"})
additional_config = {
"dynamic_eplb": True,
"num_iterations_eplb_update": 100,
"num_wait_worker_iterations": 20,
"num_redundant_experts": 2
"eplb_config": {
"dynamic_eplb": True,
"expert_heat_collection_interval": 100,
"algorithm_execution_interval": 20,
"num_redundant_experts": 2
}
}
server_args.extend(["--additional-config", json.dumps(additional_config)])
with RemoteOpenAIServer(model,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ deployment:
}
}'
--additional-config
'{"dynamic_eplb":true,"num_iterations_eplb_update":2048,"num_wait_worker_iterations":200}'
'{"enable_prefill_optimizations":true,"enable_weight_nz_layout":true,"eplb_config": {"dynamic_eplb":true,"expert_heat_collection_interval":2048,"algorithm_execution_interval":200}}'

-
server_cmd: >
Expand Down Expand Up @@ -92,7 +92,7 @@ deployment:
}
}'
--additional-config
'{"dynamic_eplb":true,"num_iterations_eplb_update":2048,"num_wait_worker_iterations":200}'
'{"enable_prefill_optimizations":true,"enable_weight_nz_layout":true,"eplb_config": {"dynamic_eplb":true,"expert_heat_collection_interval":2048,"algorithm_execution_interval":200}}'
-
server_cmd: >
vllm serve vllm-ascend/DeepSeek-R1-0528-W8A8
Expand Down Expand Up @@ -130,7 +130,7 @@ deployment:
}
}'
--additional-config
'{"multistream_overlap_shared_expert":true,"dynamic_eplb":true,"num_iterations_eplb_update":2048,"num_wait_worker_iterations":200}'
'{"multistream_overlap_shared_expert":true,"dynamic_eplb":true,"expert_heat_collection_interval":2048,"algorithm_execution_interval":200}'
-
server_cmd: >
vllm serve vllm-ascend/DeepSeek-R1-0528-W8A8
Expand Down Expand Up @@ -167,7 +167,7 @@ deployment:
}
}'
--additional-config
'{"multistream_overlap_shared_expert":true,"dynamic_eplb":true,"num_iterations_eplb_update":2048,"num_wait_worker_iterations":200}'
'{"multistream_overlap_shared_expert":true,"eplb_config": {"dynamic_eplb":true,"expert_heat_collection_interval":2048,"algorithm_execution_interval":200}}'
benchmarks:
perf:
case_type: performance
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ deployment:
}
}'
--additional-config
'{"dynamic_eplb":true,"num_iterations_eplb_update":2048,"num_wait_worker_iterations":200}'
'{"eplb_config": {"dynamic_eplb":true,"expert_heat_collection_interval":2048,"algorithm_execution_interval":200}}'

-
server_cmd: >
Expand Down Expand Up @@ -87,5 +87,5 @@ deployment:
}
}'
--additional-config
'{"dynamic_eplb":true,"num_iterations_eplb_update":2048,"num_wait_worker_iterations":200}'
'{"eplb_config": {"dynamic_eplb":true,"expert_heat_collection_interval":2048,"algorithm_execution_interval":200}}'
benchmarks:
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,12 @@ async def test_models(model: str) -> None:
additional_config: dict[str, Any] = {
"enable_shared_expert_dp": False,
"multistream_overlap_shared_expert": False,
"dynamic_eplb": True,
"num_iterations_eplb_update": 14000,
"num_wait_worker_iterations": 30,
"init_redundancy_expert": 0,
"gate_eplb": False
"eplb_config": {
"dynamic_eplb": True,
"expert_heat_collection_interval": 512,
"algorithm_execution_interval": 100,
"num_redundant_experts": 0
}
}
server_args = [
"--quantization", "ascend", "--seed", "1024",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,13 +70,13 @@ async def test_models(model: str) -> None:
"8192", "--max-num-seqs", "12", "--trust-remote-code",
"--gpu-memory-utilization", "0.9"
]
env_dict["EXPERT_MAP_RECORD"] = "true"
env_dict["DYNAMIC_EPLB"] = "true"
additional_config["dynamic_eplb"] = True
additional_config["num_iterations_eplb_update"] = 14000
additional_config["num_wait_worker_iterations"] = 30
additional_config["init_redundancy_expert"] = 0
additional_config["gate_eplb"] = False
additional_config["eplb_config"] = {
"dynamic_eplb": True,
"expert_heat_collection_interval": 512,
"algorithm_execution_interval": 100,
"num_redundant_experts": 0
}
server_args.extend(
["--compilation-config",
json.dumps(compilation_config)])
Expand Down
Loading
Loading