[MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)#37190
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a dynamic LRU cache for MoE expert weights, a valuable feature for reducing GPU memory consumption. The implementation is well-structured, adding new configurations, a dedicated LRU cache class, and integrating it into the MoE layer. The new tests for correctness are also a great addition. My main feedback focuses on a performance issue within the LRU cache implementation itself, which could be optimized for better efficiency, especially with larger cache sizes.
| for expert_id in unique_ids: | ||
| if expert_id in self._expert_to_slot: | ||
| self._lru_order.remove(expert_id) | ||
| self._lru_order.append(expert_id) | ||
| self.hits += 1 | ||
| else: | ||
| if self._free_slots: | ||
| slot = self._free_slots.pop() | ||
| else: | ||
| evicted = self._lru_order.pop(0) | ||
| slot = self._expert_to_slot.pop(evicted) | ||
|
|
||
| self._buf_w13[slot].copy_(self._cpu_w13[expert_id]) | ||
| self._buf_w2[slot].copy_(self._cpu_w2[expert_id]) | ||
|
|
||
| self._expert_to_slot[expert_id] = slot | ||
| self._lru_order.append(expert_id) | ||
| self.misses += 1 |
There was a problem hiding this comment.
The current LRU cache implementation uses a list for _lru_order, which results in O(N) complexity for remove() and pop(0) operations, where N is the cache capacity. This can become a performance bottleneck for larger cache sizes.
To improve performance to O(1) for these operations, I recommend refactoring the LRU logic to use collections.OrderedDict.
This would involve the following changes:
-
In
__init__, change_lru_orderto anOrderedDict:from collections import OrderedDict # ... # LRU state (Python-only; must stay outside torch.compile). self._expert_to_slot: dict[int, int] = {} self._free_slots: list[int] = list(range(capacity)) # Front = least-recently-used expert ID. self._lru_order: OrderedDict[int, None] = OrderedDict()
-
Update the
preparemethod to useOrderedDictmethods for efficient LRU management, as shown in the suggestion below.
| for expert_id in unique_ids: | |
| if expert_id in self._expert_to_slot: | |
| self._lru_order.remove(expert_id) | |
| self._lru_order.append(expert_id) | |
| self.hits += 1 | |
| else: | |
| if self._free_slots: | |
| slot = self._free_slots.pop() | |
| else: | |
| evicted = self._lru_order.pop(0) | |
| slot = self._expert_to_slot.pop(evicted) | |
| self._buf_w13[slot].copy_(self._cpu_w13[expert_id]) | |
| self._buf_w2[slot].copy_(self._cpu_w2[expert_id]) | |
| self._expert_to_slot[expert_id] = slot | |
| self._lru_order.append(expert_id) | |
| self.misses += 1 | |
| for expert_id in unique_ids: | |
| if expert_id in self._expert_to_slot: | |
| self._lru_order.move_to_end(expert_id) | |
| self.hits += 1 | |
| else: | |
| if self._free_slots: | |
| slot = self._free_slots.pop() | |
| else: | |
| evicted, _ = self._lru_order.popitem(last=False) | |
| slot = self._expert_to_slot.pop(evicted) | |
| self._buf_w13[slot].copy_(self._cpu_w13[expert_id]) | |
| self._buf_w2[slot].copy_(self._cpu_w2[expert_id]) | |
| self._expert_to_slot[expert_id] = slot | |
| self._lru_order[expert_id] = None | |
| self.misses += 1 |
There was a problem hiding this comment.
Fixed in 8fc9268 — replaced list-based _lru_order with collections.OrderedDict. move_to_end() for hits and popitem(last=False) for eviction are both O(1).
alvinttang
left a comment
There was a problem hiding this comment.
This is a well-designed feature — the LRU expert cache is a natural approach for running MoE models that exceed GPU memory. The implementation is clean and the code is well-documented. Here's a detailed review:
1. Thread safety concern in ExpertLRUCache.prepare()
The prepare() method mutates _expert_to_slot, _free_slots, and _lru_order without synchronization. In vLLM's current architecture, the forward pass is single-threaded on the model runner, so this is fine today. But if vLLM ever moves to concurrent forward passes (e.g., disaggregated prefill/decode with shared model weights), this would race. Worth a comment noting the single-threaded assumption.
2. Synchronous H2D copies in prepare() are a latency bottleneck
Each cache miss does a synchronous copy_() from CPU pinned memory to GPU. For large expert weights (e.g., DeepSeek-V2's 160 experts with ~7M params each), a miss could take 1-2ms per expert. If multiple misses occur in one forward pass (common with top-k=6 routing), this serialized copy could add 5-10ms per layer.
Consider using torch.cuda.Stream for async H2D copies with an event-based sync, or batching all misses into a single torch.cat + copy. The current approach is correct but may significantly impact throughput in practice.
3. The mapping tensor in prepare() is recreated every call
mapping = torch.zeros(self._num_experts, dtype=torch.int64)
for expert_id, slot in self._expert_to_slot.items():
mapping[expert_id] = slot
mapping = mapping.to(device=topk_ids.device)This allocates a new CPU tensor, fills it with a Python loop, and transfers it to GPU on every forward pass. For a model with 160 experts and 60+ layers, this adds up. Consider keeping a persistent _mapping tensor on GPU and only updating the changed entries in-place.
4. _forward_with_expert_cache bypasses several runner features
The cache forward path calls fused_experts() directly, bypassing the normal runner's handling of:
w13_bias/w2_bias(MoE layers with bias)- Expert-parallel scatter/gather
- Scale tensors for quantized weights (
w13_weight_scale,w2_weight_scale) - Custom activation functions beyond
self.activation
The EP and quantization incompatibilities are documented, but the bias case isn't mentioned. If any MoE model uses bias terms, this path would silently produce wrong results.
5. Missing enforce_eager validation
The docstring says --enforce-eager is required, but I don't see validation that rejects moe_expert_cache_size > 0 when enforce_eager=False. The @torch.compiler.disable decorator on _forward_with_expert_cache helps, but if CUDA graphs are used at a higher level, the dynamically changing buffer contents would cause correctness issues. Consider adding a config validator that errors out if moe_expert_cache_size > 0 and not enforce_eager.
6. Memory accounting
When expert weights are allocated on CPU pinned memory, vLLM's GPU memory profiler won't account for them. This means gpu_memory_utilization calculations will over-estimate available KV cache memory by the amount of expert weight memory that was moved to CPU. The profiler may need to be made aware of the CPU pinned allocation to avoid OOM during KV cache allocation.
7. Tests are good but limited
The correctness test (compare_two_settings) verifies output token matching, which is the most important thing. Consider also testing:
- Cache hit/miss counters (to verify the LRU logic is working)
- Edge case:
cache_size >= num_experts(all experts fit, no eviction) - Edge case:
cache_size = 1(maximum eviction pressure)
Overall this is a solid first implementation of MoE expert offloading. The main production concerns are the synchronous H2D copy latency and the missing enforce_eager validation.
|
Thanks for the thorough review @alvinttang! Addressing each point: 1. Thread safety — Added a comment in 2. Synchronous H2D copies — Agreed, this is the main latency bottleneck. Async H2D with double-buffered CUDA streams (the "DBO scheduling" from RFC #33869) is the top item in the planned PR 2. Mentioning it here so it's on record. 3. Persistent mapping tensor — Implemented in 68c81df. 4. Bias bypass — Guard added in 68c81df: 5. enforce_eager guard — In the code since 68c81df. From if self._moe_expert_cache_size > 0 and (
not vllm_config.model_config.enforce_eager
):
logger.warning(
"moe_expert_cache_size requires --enforce-eager; ..."
)
self._moe_expert_cache_size = 0The cache is silently disabled (not just warned) when 6. Memory accounting — Valid concern. The GPU profiler won't see CPU-pinned allocations, so it will over-allocate KV cache against memory that expert weights no longer occupy. This is actually a benefit (more KV cache headroom), not a hazard — the expert weights are no longer on GPU. But you're right that if someone relies on 7. Tests — 16 unit tests in |
4db08e9 to
618392a
Compare
|
Hi @e1n00r, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
29afd27 to
6af6bba
Compare
|
Hi @e1n00r, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Also check this paper: https://arxiv.org/html/2410.17954v1 Instead of LRU, they load with a predictor: "ExpertFlow consists of three key components: the Routing Path Predictor, the Expert Cache Engine, and the Token Scheduler. Leveraging the three synergistic components of our system, ExpertFlow achieves an average GPU memory savings of 75.4%, with peak savings reaching up to 93.72%, compared to GPU-only solutions. Furthermore, ExpertFlow attains an expert cache hit ratio of up to 91.96%, improving the hit ratio by an average of 27.65% over the LRU caching strategy. Additionally, ExpertFlow delivers a 2 to 10 times increase in inference speed." |
If I do that we just made powerinfer again, which is a well established solution in its own right. |
|
Well, apparently there's quite a few options here:
But these are not better than Predictor-based systems (e.g., ProMoE, ExpertFlow) and Learned replacement (e.g., FlashMoE ML policy). Strong non-ML alternatives: Of course, that's all for another PR, it's important to at least get this caching strategy ball rolling - the possible speedups seem to be massive. Maybe it would be nice to make the strategy pluggable? |
70e10ed to
41f367e
Compare
|
Note on commit history (5 logical commits, DCO-signed with elnur.abdullaev@sonia.so):
All features mentioned in earlier review comments remain unchanged:
|
|
Documentation preview: https://vllm--37190.org.readthedocs.build/en/37190/ |
|
The caching layer has been refactored with the strategy pattern: a new
LRU, LFU, and FIFO are thin wrappers around Your option list (LCP, ARC, LIRS, reuse-distance, MoE routing-frequency-based) are all excellent follow-ons — especially the routing-frequency-based ones that exploit the known structure of MoE routing distributions. The strategy pattern makes adding new policies a ~30 LOC drop-in. An ExpertFlow-style predictor (offline trained on routing sequences) is deferred — the cache needs to be stable and observable first so we can collect the routing statistics needed to train one. PRs 2–4 in the series will add async H2D, EPLB integration, and hit/miss telemetry export; the predictor is a natural PR 5 once that data is flowing. |
Introduces the `moe_expert_cache_size` and `moe_expert_cache_policy` fields to `OffloadConfig`, a cross-config validator in `VllmConfig` that requires `--enforce-eager` when the cache is enabled, and exposes both settings via the `--moe-expert-cache-size` / `--moe-expert-cache-policy` CLI arguments and the `LLM` Python API. Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces two new modules: - `cache_policy.py`: `ExpertCachePolicy` ABC with LRU, LFU, FIFO, and SLRU implementations via `cachetools` and a pure-Python `SLRUPolicy`. A `create_cache_policy()` factory selects the policy by name. - `lru_cache.py`: `ExpertLRUCache` — a fixed-capacity GPU scratch buffer backed by CPU pinned memory. On each forward pass, `prepare()` loads missing experts from CPU to GPU (H2D), evicts according to the chosen policy, and returns slot-remapped `topk_ids` via a persistent GPU mapping tensor (no per-call allocation). Hit/miss stats are logged at DEBUG level every 60 s via `VLLM_LOGGING_LEVEL=DEBUG`. Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
00dbdd7 to
dca8b80
Compare
|
Hi @e1n00r, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Wires the expert cache into the MoE layer: - `fused_moe_method_base.py`: adds `supports_expert_lru_cache` property (default False) for quant methods to opt in. - `layer.py`: initialises `_expert_lru_cache` in `__init__` (guards for EP, DP/SP, and enforce_eager), adds `_maybe_init_expert_lru_cache()` called after weight loading, and `_forward_with_expert_cache()` which handles the GPU fast path and a CPU fallback (`_moe_forward_cpu()`) for overflow batches. Per-layer hit/miss stats are emitted at INFO level every 300 s when `--enable-logging-iteration-details` is set. - `unquantized_fused_moe_method.py`: allocates expert weights in CPU pinned memory when the cache is requested and calls `_maybe_init_expert_lru_cache`. - `fp8.py`: sets `supports_expert_lru_cache = True` for the per-tensor FP8 path; scale tensors are registered and kept in slot-indexed GPU buffers. Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `test_expert_lru_cache.py` (18 tests): cold miss, hit-on-repeat, LRU eviction, invalidate, slot remapping, dtype preservation, GPU buffer content correctness after eviction, pinned backing store, FP8 scale buffers, overflow guard, and boundary capacities (cache==num_experts, capacity==1). - `test_cache_policy.py`: unit tests for LRU, LFU, FIFO, and SLRU policies via the `ExpertCachePolicy` ABC — hit/miss, eviction ordering, capacity boundary, and multi-policy parametrisation. - `test_moe_expert_cache.py`: end-to-end correctness via vLLM's `compare_two_settings` (with vs without cache on a small MoE model). - `.buildkite/test_areas/basic_correctness.yaml`: registers the new end-to-end test for CI. Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds `docs/features/moe_cache_policies.md` describing the four eviction policies (LRU, LFU, FIFO, SLRU), when to use each, CLI usage examples, hardware requirements, and current limitations. Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dca8b80 to
bd27b29
Compare
|
@mgoin — friendly ping for review when you have a moment. This PR adds dynamic MoE expert CPU offloading with a GPU LRU cache ( What's changed since the last round of comments (2026-03-18):
Scope is intentionally minimal (~500 LOC Python, no C++, no new kernels). Async H2D pipeline and cross-layer prefetch are planned for PR 2 (depends on this merge). The architecture is designed so that EP support, CUDA graphs, and additional quant formats are natural extensions — see the design notes in the PR description. Happy to address any feedback. |
…ject#38256) Replace ExpertLRUCache + cache_policy.py with a clean ExpertWeightProvider ABC. The cache is now a weight provider, not a special forward path — the kernel does not know or care where weights came from. Key changes: - New expert_weight_provider.py with CachedWeightProvider (LRU via OrderedDict, no cachetools dependency) and FullGPUProvider - Delete cache_policy.py (no multi-policy: LRU only in PR1) - Delete lru_cache.py (replaced by CachedWeightProvider) - Provider intercept in FusedMoEModularMethod.apply(), UnquantizedFusedMoEMethod.forward_cuda(), and Fp8MoEMethod.apply() - Remove _forward_with_expert_cache() and _moe_forward_cpu() from layer.py — all cache logic flows through apply() now - Silent config downgrades replaced with raise ValueError - Simplify offload.py policy field to Literal["lru"] - Rewrite tests for CachedWeightProvider API (20/20 passing) - Delete test_cache_policy.py - Update docs to reference RFC vllm-project#38256 AI-assisted development (Claude Code) Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Refactored to match RFC #38256:
Note on end-to-end test: My hardware (RTX PRO 2000, SM 12.0) has a CUDA toolkit version mismatch (system nvcc 12.0, PyTorch cu128) that prevents compiling vLLM's flash_attn for this arch. The OLMoE-1B-7B results in the description are from a prior run with the same cache logic. Unit tests validate all cache paths independently. Happy to re-run end-to-end on CI or if someone with a standard GPU setup can test. |
When unique experts per forward pass exceed cache capacity (common during prefill with high top_k), truncate to capacity and log a warning instead of raising RuntimeError. Decode always has exact results since top_k is typically much smaller than capacity. Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so> Co-authored-by: Claude <noreply@anthropic.com>
Purpose
Implements
ExpertWeightProvider— a weight provider abstraction for MoE expert offloading with GPU LRU cache, addressing RFC #38256.Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer. LRU eviction adapts to runtime routing — hot experts stay cached, cold ones are evicted. Models that exceed GPU VRAM can now run on smaller hardware.
Key architectural choice: The cache is a weight provider, not a special forward path. No bypass of the runner pipeline — all paths go through
runner.forward()→quant_method.apply(). EP dispatch, DP chunking, and shared expert overlap work unchanged.References:
Architecture
Integration at
FusedMoEModularMethod.apply(): replaces directlayer.w13_weightaccess withprovider.prepare(topk_ids). The provider returns GPU-resident weight tensors and remappedtopk_ids(slot indices). The kernel doesn't know or care where weights came from.torch.compile compatibility
prepare()decorated with@torch.compiler.disable— cache management stays outside compiled regionsExpertWeightResultuses fixed attributes (nodict, no boolean flags) — avoids graph breaks--enforce-eager; PR 2 will add custom ops (following [offloader] v2: Hide weight onloading latency via prefetching #29941 pattern) to remove this requirementTest results
Hardware:
Unit tests: 20/20 passing
OLMoE-1B-7B-0924 (prior run, same cache logic):
--moe-expert-cache-size 0)--moe-expert-cache-size 16)Production validation (tinyserve, same techniques, different codebase):
device_map="auto"Caveat: tinyserve numbers are single-stream on a laptop GPU. Multi-user batched inference on H100 will have different bottlenecks.
Changes
11 files, ~600 net additions (after deleting bypass code + multi-policy code)
expert_weight_provider.py(new)ExpertWeightProviderABC,FullGPUProvider(passthrough),CachedWeightProvider(GPU LRU + CPU pinned backing),ExpertWeightResultdataclassfused_moe_modular_method.pyapply()layer.pyunquantized_fused_moe_method.pyfp8.pyoffload.pymoe_cache_policies.mdtest_expert_lru_cache.pyCachedWeightProviderDeleted:
cache_policy.py,lru_cache.py,test_cache_policy.py, CPU fallback codeHow it works
Limitations (PR 1)
--enforce-eagerrequired — CUDA graph compat deferred to PR 2Test plan
Planned follow-ups (RFC #38256)
AI-assisted development (Claude Code). Architecture validated in tinyserve.
Essential Elements Checklist