[EPLB] Add offline mapping support by arpera · Pull Request #41141 · vllm-project/vllm

arpera · 2026-04-28T13:55:17Z

Purpose

This PR adds an offline mapping mode to vLLM's EPLB, similar to TensorRT-LLM's Offline EP Load Balancer. The goal is to rearrange logical-to-physical expert placement across EP ranks once, ahead of inference, so that the per-rank routed-token load is more even when running with --enable-expert-parallel, reducing per-step MoE compute imbalance between workers.

Just to highlight, new offline EPLB mode is not a replacement for existing EPLB mode: both modes can be enabled at the same time, so you can set an initial expert mapping offline and still adjust it dynamically during inference.

The flow has two phases. First, expert-load statistics are collected by serving the model over a representative workload with eplb_config.write_stats_path set — vLLM appends one eplb_load_stats JSONL record per EPLB step. Second, the same JSONL is consumed at startup via eplb_config.read_stats_path: vLLM aggregates the recorded loads, runs DefaultEplbPolicy once against the live deploy topology, and applies the resulting physical-to-logical mapping before warmup. Online rebalancing is off by default (eplb_config.enable_online=false), so the mapping stays frozen for the rest of the run; set enable_online=true to keep adjusting it dynamically on top.

The biggest gains are on workloads where token routing is close to uniformly random across experts (synthetic random datasets being the extreme case) — that's where the default identity placement maximally underutilizes some ranks while overloading others.

In vLLM there was an attempt in PR #26176 to add this support but wasn't merged for some reason.

Test Result

Performance

Hardware: 8xB200, single node.

e2e prefill-heavy random ISL=8192:

First, to generate offline mapping do the following:

# Run server
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 \
    -tp 1 -pp 1 -dp 8 \
    --enable-expert-parallel \
    --language-model-only \
    --reasoning-parser qwen3 \
    --stream-interval 100 \
    --enable-eplb \
    --eplb-config '{"write_stats_path":"./baseline.jsonl"}'

Then run e2e test:

# Server (baseline, identity placement, no rearrangement)
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 \
    -tp 1 -pp 1 -dp 8 \
    --enable-expert-parallel \
    --language-model-only \
    --reasoning-parser qwen3 \
    --stream-interval 100 \
    --enable-eplb

# Server (offline mapping, num_redundant_experts=64 -> 8 replicas/rank)
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 \
    -tp 1 -pp 1 -dp 8 \
    --enable-expert-parallel \
    --language-model-only \
    --reasoning-parser qwen3 \
    --stream-interval 100 \
    --enable-eplb \
    --eplb-config '{"read_stats_path":"./baseline.jsonl","num_redundant_experts":64}'

# Client, 5 runs each
vllm bench serve --backend vllm \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 \
    --endpoint /v1/completions \
    --ignore-eos --temperature 0.0 \
    --dataset-name random --random-input 8192 --random-output 1 \
    --num-prompt 1024 --max-concurrency 128

Best of 5 runs in each variant:

Metric	Baseline	offline EPLB (`num_redundant_experts=64`)
Benchmark duration (s)	45.30	42.84
Request throughput (req/s)	22.60	23.90
Total token throughput (tok/s)	185195.58	195819.45
Mean TTFT (ms)	5302.39	5011.73
Median TTFT (ms)	5580.84	5270.45
P99 TTFT (ms)	6640.33	6267.26

Total token throughput +5.7%, mean TTFT −5.5%, P99 TTFT −5.6% with the offline EPLB.

Bonus

For this e2e prefill-heavy benchmark with random inputs I also collected visual representation of workload distribution between ranks on each layer for each step. It was really helpful for me to analyze the statistics this way. This is an interactive self-contained web pages that renders graphs and table. If you are interested have a look as well.

baseline.html
8replicas.html

See how much imbalance between ranks was reduced using offline EPLB:

Screenshots

baseline (imbalance ~1.6):

8replicas (imbalance ~1.07):

That means on average imbalance decreased by 1.07/1.6 = 0.67 times.

In my prefill-heavy benchmark run the whole MoE block consists of 3 main parts: 1st communication, computations, 2nd communication. Based on nsys profile computaions take ~20% of time of the whole MoE block. Since we decreased imbalance in computations part of MoE by 0.67 times then now it's 20% * 0.67 = 13.4%. Then in ideal case total speed up would be 20% - 13.4% = 6.6% which is pretty close to the speed up we see on practice (5.7%).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-04-28T13:56:04Z

Documentation preview: https://vllm--41141.org.readthedocs.build/en/41141/

gemini-code-assist

Code Review

This pull request implements offline Expert Parallel Load Balancing (EPLB) mapping by adding functionality to log expert-load statistics and load static mappings from JSONL files. It includes a new generate_static_mapping.py tool, updates to EplbState for initial weight rearrangement and optional runtime rebalancing disablement, and corresponding documentation and tests. I have no feedback to provide.

ilmarkov

Thank you for the PR!

Added the comments. Major issues:

We don't want to sync and dump steps at every step when we dump is enabled.
The moment when we do initial memory movements has to be out of profile.
Also, it needs to be verified how it behaves with elastic ep.

Could you add experiments with non-random data?

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

… of NCCL

mergify · 2026-05-07T06:52:25Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

vadiklyutiy · 2026-05-07T12:14:02Z

@arpera and I had a conversation. Seems it would be better to save expert stat only and during model loading make a decision about expert placing. It allows to use one stat file for model and any TP* will use the file.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-05-08T06:57:05Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-05-08T13:34:01Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-05-09T12:41:15Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-05-09T13:35:00Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

[EPLB] Add offline mapping support

63727bf

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners April 28, 2026 13:55

claude Bot reviewed Apr 28, 2026

View reviewed changes

mergify Bot added the documentation Improvements or additions to documentation label Apr 28, 2026

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

vadiklyutiy self-requested a review April 29, 2026 19:50

ilmarkov suggested changes May 5, 2026

View reviewed changes

arpera added 2 commits May 5, 2026 18:43

Validate mapping is the same for all the ranks

c6010e7

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Set pending_initial_mapping_rearrange flag per model

9772a7d

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera requested a review from njhill as a code owner May 5, 2026 16:52

mergify Bot added the v1 label May 5, 2026

arpera added 5 commits May 5, 2026 20:46

Add expert_load_stats_interval config parameter

c553c0c

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Close file in case of elastic

03a477f

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Extend metadata of statistics jsonl file

77ec44b

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Re-use rearrange_expert_weights_inplace in apply

d2bb7b7

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Revert rearrange_expert_weights_inplace due to found deadlock because…

f613540

… of NCCL

Add implementation of helper functionality for performance research

d4161a3

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify Bot added the frontend label May 8, 2026

Align PR with Patryk's PR vllm-project#26176; global options renaming

7bbd8ba

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Implement from scratch, NOT verified yet

595b5bb

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

If save_path is set in config then do not rebalance

eb02234

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Uh oh!

Conversation

arpera commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Result

Performance

Bonus

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ilmarkov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented May 7, 2026

Uh oh!

vadiklyutiy commented May 7, 2026

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

mergify Bot commented May 9, 2026

Uh oh!

mergify Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arpera commented Apr 28, 2026 •

edited

Loading