[Performance] Add --enable-ep-weight-filter CLI option#37351
Conversation
Add opt-in flag to skip non-local expert weights during model loading when expert parallelism is active. Each rank only reads its own expert shard from disk, reducing storage I/O for MoE models with per-expert weight tensors. Signed-off-by: esmeetu <esmeetu@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: esmeetu <jasonailu87@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces an opt-in command-line flag --enable-ep-weight-filter to optimize model loading for Mixture-of-Experts models with expert parallelism. The changes correctly add the new configuration option and integrate it into the model loading logic. My main feedback is to add a validation check to ensure enable_expert_parallel is active when enable_ep_weight_filter is used, to prevent silent failures from misconfiguration and improve user experience.
| """Whether the deployed model is MoE (if known).""" | ||
| enable_expert_parallel: bool = False | ||
| """Use expert parallelism instead of tensor parallelism for MoE layers.""" | ||
| enable_ep_weight_filter: bool = False |
There was a problem hiding this comment.
To improve robustness and prevent user confusion from misconfiguration, it's a good practice to validate that enable_expert_parallel is enabled when enable_ep_weight_filter is used. Currently, if a user enables enable_ep_weight_filter without enable_expert_parallel, it will fail silently.
Consider adding a validation check in the _validate_parallel_config method of this class, similar to how enable_eplb is validated. This would raise an error for invalid combinations.
Example:
if self.enable_ep_weight_filter and not self.enable_expert_parallel:
raise ValueError(
"enable_expert_parallel must be True to use enable_ep_weight_filter."
)Signed-off-by: esmeetu <jasonailu87@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> (cherry picked from commit 761e0aa)
…37351) Signed-off-by: esmeetu <jasonailu87@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…37351) Signed-off-by: esmeetu <jasonailu87@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…37351) Signed-off-by: esmeetu <jasonailu87@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…37351) Signed-off-by: esmeetu <jasonailu87@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
--enable-ep-weight-filteropt-in CLI flag to skip non-local expert weights during model loading when EP is activeUsage
Without
--enable-ep-weight-filter, loading behavior is identical to main.Test plan
vllm servewithout--enable-ep-weight-filter— no behavior changevllm serve --enable-expert-parallel --enable-ep-weight-filteron per-expert MoE — correct loading, reduced I/O🤖 Generated with Claude Code