[EPLB] Add offline mapping support#41141
Conversation
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Documentation preview: https://vllm--41141.org.readthedocs.build/en/41141/ |
There was a problem hiding this comment.
Code Review
This pull request implements offline Expert Parallel Load Balancing (EPLB) mapping by adding functionality to log expert-load statistics and load static mappings from JSONL files. It includes a new generate_static_mapping.py tool, updates to EplbState for initial weight rearrangement and optional runtime rebalancing disablement, and corresponding documentation and tests. I have no feedback to provide.
ilmarkov
left a comment
There was a problem hiding this comment.
Thank you for the PR!
Added the comments. Major issues:
- We don't want to sync and dump steps at every step when we dump is enabled.
- The moment when we do initial memory movements has to be out of profile.
- Also, it needs to be verified how it behaves with elastic ep.
Could you add experiments with non-random data?
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@arpera and I had a conversation. Seems it would be better to save expert stat only and during model loading make a decision about expert placing. It allows to use one stat file for model and any TP* will use the file. |
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Purpose
This PR adds an offline mapping mode to vLLM's EPLB, similar to TensorRT-LLM's Offline EP Load Balancer. The goal is to rearrange logical-to-physical expert placement across EP ranks once, ahead of inference, so that the per-rank routed-token load is more even when running with
--enable-expert-parallel, reducing per-step MoE compute imbalance between workers.Just to highlight, new offline EPLB mode is not a replacement for existing EPLB mode: both modes can be enabled at the same time, so you can set an initial expert mapping offline and still adjust it dynamically during inference.
The flow has two phases. First, expert-load statistics are collected by serving the model over a representative workload with
eplb_config.write_stats_pathset — vLLM appends oneeplb_load_statsJSONL record per EPLB step. Second, the same JSONL is consumed at startup viaeplb_config.read_stats_path: vLLM aggregates the recorded loads, runsDefaultEplbPolicyonce against the live deploy topology, and applies the resulting physical-to-logical mapping before warmup. Online rebalancing is off by default (eplb_config.enable_online=false), so the mapping stays frozen for the rest of the run; setenable_online=trueto keep adjusting it dynamically on top.The biggest gains are on workloads where token routing is close to uniformly random across experts (synthetic random datasets being the extreme case) — that's where the default identity placement maximally underutilizes some ranks while overloading others.
In vLLM there was an attempt in PR #26176 to add this support but wasn't merged for some reason.
Test Result
Performance
Hardware: 8xB200, single node.
e2e prefill-heavy random ISL=8192:
First, to generate offline mapping do the following:
Then run e2e test:
Best of 5 runs in each variant:
num_redundant_experts=64)Total token throughput +5.7%, mean TTFT −5.5%, P99 TTFT −5.6% with the offline EPLB.
Bonus
For this e2e prefill-heavy benchmark with random inputs I also collected visual representation of workload distribution between ranks on each layer for each step. It was really helpful for me to analyze the statistics this way. This is an interactive self-contained web pages that renders graphs and table. If you are interested have a look as well.
baseline.html
8replicas.html
See how much imbalance between ranks was reduced using offline EPLB:
Screenshots
baseline (imbalance ~1.6):

8replicas (imbalance ~1.07):

That means on average imbalance decreased by 1.07/1.6 = 0.67 times.
In my prefill-heavy benchmark run the whole MoE block consists of 3 main parts: 1st communication, computations, 2nd communication. Based on nsys profile computaions take ~20% of time of the whole MoE block. Since we decreased imbalance in computations part of MoE by 0.67 times then now it's 20% * 0.67 = 13.4%. Then in ideal case total speed up would be 20% - 13.4% = 6.6% which is pretty close to the speed up we see on practice (5.7%).
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.