MoE: Port PyVLLM MoE expert mapping logic#287
Conversation
sempervictus
commented
Apr 2, 2026
- Added build_initial_global_physical_to_logical_map() to create expert ID mappings with modulo wrapping for redundant experts
- Added has_base_layer_prefix() helper for checkpoint format detection
- Added make_expert_params_mapping() mirroring Python vLLM SharedFusedMoE.make_expert_params_mapping
- Added make_expert_params_mapping_with_base_layer() for PEFT adapter support
- Added make_expert_fp8_weight_scale_mapping() and make_expert_fp8_activation_scale_mapping() for FP8 quantization
- Added load_fused_expert_weights() helper for per-expert weight loading
- Added load_packed_with_mapping() using logical expert IDs via mapping
- Added load_packed_physical() legacy function for packed checkpoint format
- Added load_packed() dispatch function that auto-detects checkpoint format
- Dispatch logic: checks for "gate_up_proj" key to determine packed vs per-expert format
- Verified: cargo check --features cuda passes
|
I think i found the reason MoE (unquantized) are taking up way more VRAM than they should when spread between several GPUs - it looks like every shard is loading all of the experts which would explain the memory occupancy balloon. Hoping this is the right path, would love some insight as to how rational or irrational this is @guoqingbao 😄 |
|
Confirm this (or at least some change to the loading mechanics since late Jan) fixes the Qwen3Next 80B loading on V100s at q8_0 - this OOMed previously:
and despite the output stutter problem on V100s (since ~early February) which drops the overall decoding average, seeing: [Seq 5] ⏱️ Prompt: 8941 tokens in 9.59s (932.62 t/s)
[Seq 5] ⏱️ Decoded: 1503 tokens in 76.89s (19.55 t/s)on those same V100s at |
|
Sorry wrong logs - disregard, more testing ongoing but i can say it doesnt crash on any of the test hosts from V100->SM121 |
You load weights into DType::F32 in this PR, it should costs more GPU memory usage, not less. The current main shards weights into different GPUs but comes slight overhead, while, it has more overheads when you using VL models (we don't shard VL vision tower, intead, we replicate across difference GPUs). |
|
🤔 I thought that was the scaling target 🤦♂️ - still not ooming on the v100s where I couldn't get this to run last time but I'll definitely dig in deeper |
- Added build_initial_global_physical_to_logical_map() to create expert ID mappings with modulo wrapping for redundant experts - Added has_base_layer_prefix() helper for checkpoint format detection - Added make_expert_params_mapping() mirroring Python vLLM SharedFusedMoE.make_expert_params_mapping - Added make_expert_params_mapping_with_base_layer() for PEFT adapter support - Added make_expert_fp8_weight_scale_mapping() and make_expert_fp8_activation_scale_mapping() for FP8 quantization - Added load_fused_expert_weights() helper for per-expert weight loading - Added load_packed_with_mapping() using logical expert IDs via mapping - Added load_packed_physical() legacy function for packed checkpoint format - Added load_packed() dispatch function that auto-detects checkpoint format - Dispatch logic: checks for "gate_up_proj" key to determine packed vs per-expert format - Verified: cargo check --features cuda passes
16a0ff8 to
87742b8
Compare