[Performance][Model Loader] Skip non-local expert weights during EP model loading#37136
[Performance][Model Loader] Skip non-local expert weights during EP model loading#37136
Conversation
Signed-off-by: esmeetu <jasonailu87@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance optimization for loading Mixture-of-Experts (MoE) models with expert parallelism (EP) enabled. By filtering out non-local expert weights before they are read from disk, it effectively reduces I/O, which is particularly beneficial for large models. The implementation is well-structured, with the core filtering logic encapsulated in a new ep_weight_filter.py module and accompanied by a comprehensive test suite. The changes are correctly integrated into the model loading pipeline. I've found one critical issue regarding the calculation of expert parallelism rank and size, which omits the prefill context parallelism dimension. Addressing this will ensure the correctness of expert filtering in all configurations.
|
Hi @esmeetu, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90784738f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Hi @esmeetu, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
…37136) The EP weight filter (PR vllm-project#37136) partitions logical experts across ranks and skips non-local expert weights at the safetensors level. This breaks EPLB because redundant physical expert slots map to logical experts that belong to other ranks in the default partition. Those weights get filtered out, leaving redundant slots uninitialized (zeros), which causes catastrophic accuracy loss (~0.08 gsm8k vs ~0.95 baseline). Fix: skip the EP weight filter entirely when EPLB is enabled, since the weight loader needs to see ALL logical expert weights to populate redundant physical slots. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…odel loading (vllm-project#37136) Signed-off-by: esmeetu <jasonailu87@gmail.com>
…odel loading (vllm-project#37136) Signed-off-by: esmeetu <jasonailu87@gmail.com>
…odel loading (vllm-project#37136) Signed-off-by: esmeetu <jasonailu87@gmail.com>
…ant experts (#7470) ### What this PR does / why we need it? pr: vllm-project/vllm#37136 break eplb because it filters out redundant experts. pr: vllm-project/vllm#37322 fix it due to use parallel_config.enable_eplb to determine whether to skip the weight loading filter. But in vllm-ascend, parallel_config.enable_eplb is always false. When we use eplb, we temporarily set it to true. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested?  | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 86.67 | Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
…ant experts (vllm-project#7470) ### What this PR does / why we need it? pr: vllm-project/vllm#37136 break eplb because it filters out redundant experts. pr: vllm-project/vllm#37322 fix it due to use parallel_config.enable_eplb to determine whether to skip the weight loading filter. But in vllm-ascend, parallel_config.enable_eplb is always false. When we use eplb, we temporarily set it to true. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested?  | dataset | version | metric | mode | vllm-api-stream-chat | |----- | ----- | ----- | ----- | -----| | aime2024 | 604a78 | accuracy | gen | 86.67 | Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Purpose
In DP+EP deployments, every rank currently reads all expert weights from disk via
safe_open().get_tensor(), only forFusedMoE.weight_loaderto discard non-local experts afterward. For large MoE models (e.g. Kimi-K2.5 with 384 experts at 591GB), each rank reads the full 591GB but only keeps ~144GB (dense + 1/8 experts).This PR moves the filtering before the disk read by checking the tensor name against the local expert set in the weight iterator, so
f.get_tensor()is never called for non-local experts.Test Plan
local_expert_idsremainsNone)Benchmark Results
Kimi-K2.5-NVFP4 (591GB, 384 experts):
DP/EP=4 (1 node × 4 GPUs)
DP/EP=8 (2 nodes × 4 GPUs)
DP/EP=16 (4 nodes × 4 GPUs)
Speedup scales with EP size: higher EP → more experts skipped → greater I/O reduction.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.