[EPLB] Fix balancedness metric computation and add verbose reporting by arpera · Pull Request #39178 · vllm-project/vllm

arpera · 2026-04-07T11:44:19Z

Purpose

This PR fixes the balancedness metric computation in EPLB to correctly reflect the load imbalance per MoE layer.

Previously, balancedness was calculated by computing the average and max load across layers (dim=0) instead of across ranks (dim=1), and then summing those values.

This PR changes the balancedness metric to:

Compute the max/mean ratio of tokens per rank independently for each MoE layer.
Average these per-layer ratios to produce the final balancedness metric.

This change ensures the logged metric accurately represents the average severity of the bottleneck at each MoE layer.

Additional changes in this PR:

Added a log_balancedness_verbose configuration flag (default: False). When enabled, EPLB logs a detailed multi-line report per logging interval, which includes a per-layer / per-rank token table to help debug expert routing.
Documented the log_balancedness_verbose flag in docs/serving/expert_parallel_deployment.md.
Updated the docs/serving/expert_parallel_deployment.md documentation to include previously added EPLB configuration fields that were missing from the docs:
- log_balancedness_interval (originally introduced in #29499)
- communicator (introduced in #33176)

Test Plan

Manual vLLM local launches to verify the logs.

Validation Result

Should not affect production performance since balancedness is calculated and printed only when the log_balancedness option is set in the vLLM config.

Example of the new verbose logging output:

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

The comment says "for each layer: (mean load across ranks) / (max load across ranks)" but the code was using dim=0 (averaging/maxing across layers) instead of dim=1 (across ranks within each layer). Fix to match the documented intent: compute mean/max across EP ranks for each MoE layer independently, then average the per-layer ratios over active layers. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-04-07T11:48:28Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request refactors the balancedness calculation in eplb_state.py to use an average of per-layer ratios instead of a global ratio. The feedback suggests using float64 for these calculations to prevent potential precision loss as model depth or batch sizes increase.

Compute per-layer mean and max in float64 to prevent precision loss when summing token counts across many MoE layers. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera · 2026-04-07T12:41:47Z

Valid points. Would it be ok if I heavily refactor this logging function and instead avg_tokens, max_tokens, and balancedness introduce more information about each layer? For example, I'll print hottest and coldest experts on current step, tokens' distribution among ranks on each layer on one step, distribution of experts among ranks, etc. I mean there won't be any one aggregated metric (like balancedness now), but instead I'll output detailed info about step. @ilmarkov What do you think about this idea?

tlrmchlsmth · 2026-04-07T13:44:42Z

Valid points. Would it be ok if I heavily refactor this logging function and instead avg_tokens, max_tokens, and balancedness introduce more information about each layer? For example, I'll print hottest and coldest experts on current step, tokens' distribution among ranks on each layer on one step, distribution of experts among ranks, etc. I mean there won't be any one aggregated metric (like balancedness now), but instead I'll output detailed info about step. @ilmarkov What do you think about this idea?

I think that would be great to expose. Maybe guard it behind a verbosity that we set in the eplb config?

arpera · 2026-04-07T13:47:19Z

Maybe guard it behind a verbosity that we set in the eplb config?

Yes, great! Could you tell me please one more detail what is the name of this flag in the eplb config?

tlrmchlsmth · 2026-04-07T20:42:50Z

Could you tell me please one more detail what is the name of this flag in the eplb config?

OH, sorry for the confusion - I was suggesting you add a flag to the eplb config (verbose_logging?)

arpera · 2026-04-07T20:45:19Z

Yes, I was also thinking about this idea. Now work in progress, stay tuned.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-04-09T10:16:18Z

Documentation preview: https://vllm--39178.org.readthedocs.build/en/39178/

ilmarkov

I see this is still in progress — apologies if some of my comments are on things you're already planning to change. Happy to re-review once you're ready.

Nice touch with the ANSI heat coloring. Note though that in practice most users won't see it: in production, vLLM logs are typically collected from containers (Docker, Kubernetes). I would suggest to consider saving verbose stats in some structured format (maybe not in stdout/stderr) to ease an analysis.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

vadiklyutiy · 2026-04-29T20:45:09Z

Friendly ping

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera · 2026-04-30T07:19:17Z

@ilmarkov @tlrmchlsmth, PR is ready for a final review. Please have a look

ilmarkov

Thanks for the update! I'd still insist on saving the verbose log into a file for easier analysis. Also added some comments on improving the logged info.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-04-30T11:38:40Z

Documentation preview: https://vllm--39178.org.readthedocs.build/en/39178/

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-04-30T14:25:17Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

arpera · 2026-04-30T14:44:14Z

@ilmarkov I presume if everything else is ok then let us approve this PR. ~~Issue from mergify is wrong, see 41377~~

ilmarkov

Thank you for the update!

The non-verbose part changes look good to me. In the verbose, I'd need some changes if we still want this piece after expert_load_dump_dir introduction. Although, I don't see the case we manually check stderr given that we have machine-parsable dumping with the same frequency. Maybe, for single node debugging only.

@tlrmchlsmth what do you think?

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera · 2026-05-06T12:11:36Z

@vadiklyutiy, can we merge this? I think I fixed all the issues and change is ready for merge now.

gemini-code-assist Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread vllm/distributed/eplb/eplb_state.py Outdated

[EPLB] Use float64 for balancedness summation to avoid precision loss

3bd5159

Compute per-layer mean and max in float64 to prevent precision loss when summing token counts across many MoE layers. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

ilmarkov suggested changes Apr 7, 2026

View reviewed changes

Comment thread vllm/distributed/eplb/eplb_state.py Outdated

Comment thread vllm/distributed/eplb/eplb_state.py Outdated

[EPLB] Fix balancedness metric computation and add verbose reporting

29c8d66

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners April 9, 2026 10:15

mergify Bot added the documentation Improvements or additions to documentation label Apr 9, 2026

arpera changed the title ~~[EPLB] Fix balancedness computation: use per-layer mean/max across ranks~~ [EPLB] Fix balancedness metric computation and add verbose reporting Apr 9, 2026

ilmarkov reviewed Apr 10, 2026

View reviewed changes

Address review comments on balancedness reporting

44a1887

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Merge branch 'main' into artem/fix-eplb-balancedness

8b54a8d

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

ilmarkov suggested changes Apr 30, 2026

View reviewed changes

Comment thread vllm/distributed/eplb/eplb_state.py

Comment thread vllm/distributed/eplb/eplb_state.py Outdated

Comment thread vllm/distributed/eplb/eplb_state.py Outdated

Comment thread vllm/distributed/eplb/eplb_state.py Outdated

[EPLB] Add imbalance, global_step, async-overdue, and JSONL dump

9b1ee18

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera added 2 commits April 30, 2026 14:57

[Doc] Document log_balancedness_interval

7e83952

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Merge branch 'main' into artem/fix-eplb-balancedness

030f7f8

Merge branch 'main' into artem/fix-eplb-balancedness

3468ef0

ilmarkov suggested changes May 5, 2026

View reviewed changes

Comment thread vllm/distributed/eplb/eplb_state.py Outdated

ilmarkov mentioned this pull request May 5, 2026

[EPLB] Add offline mapping support #41141

Open

4 tasks

Replace print by logger.info

fe20aaa

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Uh oh!

Conversation

arpera commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Validation Result

Uh oh!

mergify Bot commented Apr 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arpera commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth commented Apr 7, 2026

Uh oh!

arpera commented Apr 7, 2026

Uh oh!

tlrmchlsmth commented Apr 7, 2026

Uh oh!

arpera commented Apr 7, 2026

Uh oh!

mergify Bot commented Apr 9, 2026

Uh oh!

ilmarkov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vadiklyutiy commented Apr 29, 2026

Uh oh!

arpera commented Apr 30, 2026

Uh oh!

ilmarkov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

arpera commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilmarkov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arpera commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arpera commented Apr 7, 2026 •

edited

Loading

arpera commented Apr 7, 2026 •

edited

Loading

arpera commented Apr 30, 2026 •

edited

Loading