[MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)#37190

Open

e1n00r wants to merge 3 commits intovllm-project:mainfrom

e1n00r:feature/moe-expert-lru-cache

Commits on Apr 9, 2026

[MoE][Offload] CachedWeightProvider — run MoE models exceeding VRAM via expert CPU offloading

committed
fix: mypy errors — assert best_key/experts_cls not None, ruff format
e1n00r
committed
fix(moe-cache): raise RuntimeError on prefill overflow instead of silent truncation

e1n00r
and
claude
committed