[EPLB] Fix balancedness metric computation and add verbose reporting#39178
[EPLB] Fix balancedness metric computation and add verbose reporting#39178arpera wants to merge 10 commits intovllm-project:mainfrom
Conversation
The comment says "for each layer: (mean load across ranks) / (max load across ranks)" but the code was using dim=0 (averaging/maxing across layers) instead of dim=1 (across ranks within each layer). Fix to match the documented intent: compute mean/max across EP ranks for each MoE layer independently, then average the per-layer ratios over active layers. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request refactors the balancedness calculation in eplb_state.py to use an average of per-layer ratios instead of a global ratio. The feedback suggests using float64 for these calculations to prevent potential precision loss as model depth or batch sizes increase.
Compute per-layer mean and max in float64 to prevent precision loss when summing token counts across many MoE layers. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Valid points. Would it be ok if I heavily refactor this logging function and instead |
I think that would be great to expose. Maybe guard it behind a verbosity that we set in the eplb config? |
Yes, great! Could you tell me please one more detail what is the name of this flag in the eplb config? |
OH, sorry for the confusion - I was suggesting you add a flag to the eplb config ( |
|
Yes, I was also thinking about this idea. Now work in progress, stay tuned. |
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Documentation preview: https://vllm--39178.org.readthedocs.build/en/39178/ |
ilmarkov
left a comment
There was a problem hiding this comment.
I see this is still in progress — apologies if some of my comments are on things you're already planning to change. Happy to re-review once you're ready.
Nice touch with the ANSI heat coloring. Note though that in practice most users won't see it: in production, vLLM logs are typically collected from containers (Docker, Kubernetes). I would suggest to consider saving verbose stats in some structured format (maybe not in stdout/stderr) to ease an analysis.
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Friendly ping |
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
@ilmarkov @tlrmchlsmth, PR is ready for a final review. Please have a look |
ilmarkov
left a comment
There was a problem hiding this comment.
Thanks for the update! I'd still insist on saving the verbose log into a file for easier analysis. Also added some comments on improving the logged info.
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Documentation preview: https://vllm--39178.org.readthedocs.build/en/39178/ |
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
ilmarkov
left a comment
There was a problem hiding this comment.
Thank you for the update!
The non-verbose part changes look good to me. In the verbose, I'd need some changes if we still want this piece after expert_load_dump_dir introduction. Although, I don't see the case we manually check stderr given that we have machine-parsable dumping with the same frequency. Maybe, for single node debugging only.
@tlrmchlsmth what do you think?
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
@vadiklyutiy, can we merge this? I think I fixed all the issues and change is ready for merge now. |
Purpose
This PR fixes the
balancednessmetric computation in EPLB to correctly reflect the load imbalance per MoE layer.Previously, balancedness was calculated by computing the average and max load across layers (
dim=0) instead of across ranks (dim=1), and then summing those values.This PR changes the
balancednessmetric to:max/meanratio of tokens per rank independently for each MoE layer.balancednessmetric.This change ensures the logged metric accurately represents the average severity of the bottleneck at each MoE layer.
Additional changes in this PR:
log_balancedness_verboseconfiguration flag (default:False). When enabled, EPLB logs a detailed multi-line report per logging interval, which includes a per-layer / per-rank token table to help debug expert routing.log_balancedness_verboseflag indocs/serving/expert_parallel_deployment.md.docs/serving/expert_parallel_deployment.mddocumentation to include previously added EPLB configuration fields that were missing from the docs:log_balancedness_interval(originally introduced in #29499)communicator(introduced in #33176)Test Plan
Manual vLLM local launches to verify the logs.
Validation Result
Should not affect production performance since balancedness is calculated and printed only when the
log_balancednessoption is set in the vLLM config.Example of the new verbose logging output:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.