Skip to content

[EPLB] Add offline mapping support#41141

Open
arpera wants to merge 12 commits intovllm-project:mainfrom
arpera:offline-eplb
Open

[EPLB] Add offline mapping support#41141
arpera wants to merge 12 commits intovllm-project:mainfrom
arpera:offline-eplb

Conversation

@arpera
Copy link
Copy Markdown
Contributor

@arpera arpera commented Apr 28, 2026

Purpose

This PR adds an offline mapping mode to vLLM's EPLB, similar to TensorRT-LLM's Offline EP Load Balancer. The goal is to rearrange logical-to-physical expert placement across EP ranks once, ahead of inference, so that the per-rank routed-token load is more even when running with --enable-expert-parallel, reducing per-step MoE compute imbalance between workers.

Just to highlight, new offline EPLB mode is not a replacement for existing EPLB mode: both modes can be enabled at the same time, so you can set an initial expert mapping offline and still adjust it dynamically during inference.

The flow has two phases. First, expert-load statistics are collected by serving the model over a representative workload with eplb_config.write_stats_path set — vLLM appends one eplb_load_stats JSONL record per EPLB step. Second, the same JSONL is consumed at startup via eplb_config.read_stats_path: vLLM aggregates the recorded loads, runs DefaultEplbPolicy once against the live deploy topology, and applies the resulting physical-to-logical mapping before warmup. Online rebalancing is off by default (eplb_config.enable_online=false), so the mapping stays frozen for the rest of the run; set enable_online=true to keep adjusting it dynamically on top.

The biggest gains are on workloads where token routing is close to uniformly random across experts (synthetic random datasets being the extreme case) — that's where the default identity placement maximally underutilizes some ranks while overloading others.

In vLLM there was an attempt in PR #26176 to add this support but wasn't merged for some reason.

Test Result

Performance

Hardware: 8xB200, single node.

e2e prefill-heavy random ISL=8192:

First, to generate offline mapping do the following:

# Run server
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 \
    -tp 1 -pp 1 -dp 8 \
    --enable-expert-parallel \
    --language-model-only \
    --reasoning-parser qwen3 \
    --stream-interval 100 \
    --enable-eplb \
    --eplb-config '{"write_stats_path":"./baseline.jsonl"}'

Then run e2e test:

# Server (baseline, identity placement, no rearrangement)
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 \
    -tp 1 -pp 1 -dp 8 \
    --enable-expert-parallel \
    --language-model-only \
    --reasoning-parser qwen3 \
    --stream-interval 100 \
    --enable-eplb

# Server (offline mapping, num_redundant_experts=64 -> 8 replicas/rank)
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 \
    -tp 1 -pp 1 -dp 8 \
    --enable-expert-parallel \
    --language-model-only \
    --reasoning-parser qwen3 \
    --stream-interval 100 \
    --enable-eplb \
    --eplb-config '{"read_stats_path":"./baseline.jsonl","num_redundant_experts":64}'

# Client, 5 runs each
vllm bench serve --backend vllm \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 \
    --endpoint /v1/completions \
    --ignore-eos --temperature 0.0 \
    --dataset-name random --random-input 8192 --random-output 1 \
    --num-prompt 1024 --max-concurrency 128

Best of 5 runs in each variant:

Metric Baseline offline EPLB (num_redundant_experts=64)
Benchmark duration (s) 45.30 42.84
Request throughput (req/s) 22.60 23.90
Total token throughput (tok/s) 185195.58 195819.45
Mean TTFT (ms) 5302.39 5011.73
Median TTFT (ms) 5580.84 5270.45
P99 TTFT (ms) 6640.33 6267.26

Total token throughput +5.7%, mean TTFT −5.5%, P99 TTFT −5.6% with the offline EPLB.

Bonus

For this e2e prefill-heavy benchmark with random inputs I also collected visual representation of workload distribution between ranks on each layer for each step. It was really helpful for me to analyze the statistics this way. This is an interactive self-contained web pages that renders graphs and table. If you are interested have a look as well.

baseline.html
8replicas.html

See how much imbalance between ranks was reduced using offline EPLB:

Screenshots

baseline (imbalance ~1.6):
image

8replicas (imbalance ~1.07):
image

That means on average imbalance decreased by 1.07/1.6 = 0.67 times.

In my prefill-heavy benchmark run the whole MoE block consists of 3 main parts: 1st communication, computations, 2nd communication. Based on nsys profile computaions take ~20% of time of the whole MoE block. Since we decreased imbalance in computations part of MoE by 0.67 times then now it's 20% * 0.67 = 13.4%. Then in ideal case total speed up would be 20% - 13.4% = 6.6% which is pretty close to the speed up we see on practice (5.7%).


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 28, 2026

Documentation preview: https://vllm--41141.org.readthedocs.build/en/41141/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Apr 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements offline Expert Parallel Load Balancing (EPLB) mapping by adding functionality to log expert-load statistics and load static mappings from JSONL files. It includes a new generate_static_mapping.py tool, updates to EplbState for initial weight rearrangement and optional runtime rebalancing disablement, and corresponding documentation and tests. I have no feedback to provide.

@vadiklyutiy vadiklyutiy self-requested a review April 29, 2026 19:50
Copy link
Copy Markdown
Contributor

@ilmarkov ilmarkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Added the comments. Major issues:

  1. We don't want to sync and dump steps at every step when we dump is enabled.
  2. The moment when we do initial memory movements has to be out of profile.
  3. Also, it needs to be verified how it behaves with elastic ep.

Could you add experiments with non-random data?

Comment thread vllm/distributed/eplb/eplb_state.py Outdated
Comment thread vllm/distributed/eplb/eplb_state.py Outdated
Comment thread vllm/distributed/eplb/eplb_state.py Outdated
Comment thread vllm/distributed/eplb/eplb_state.py Outdated
Comment thread tools/eplb/generate_static_mapping.py Outdated
Comment thread vllm/distributed/eplb/eplb_state.py Outdated
Comment thread vllm/distributed/eplb/eplb_state.py Outdated
arpera added 2 commits May 5, 2026 18:43
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@arpera arpera requested a review from njhill as a code owner May 5, 2026 16:52
@mergify mergify Bot added the v1 label May 5, 2026
arpera added 5 commits May 5, 2026 20:46
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 7, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

@arpera and I had a conversation. Seems it would be better to save expert stat only and during model loading make a decision about expert placing. It allows to use one stat file for model and any TP* will use the file.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@mergify mergify Bot added the frontend label May 8, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 8, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 8, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 9, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 9, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants