MoE: Port PyVLLM MoE expert mapping logic by sempervictus · Pull Request #287 · guoqingbao/xinfer

sempervictus · 2026-04-02T06:10:22Z

Added build_initial_global_physical_to_logical_map() to create expert ID mappings with modulo wrapping for redundant experts
Added has_base_layer_prefix() helper for checkpoint format detection
Added make_expert_params_mapping() mirroring Python vLLM SharedFusedMoE.make_expert_params_mapping
Added make_expert_params_mapping_with_base_layer() for PEFT adapter support
Added make_expert_fp8_weight_scale_mapping() and make_expert_fp8_activation_scale_mapping() for FP8 quantization
Added load_fused_expert_weights() helper for per-expert weight loading
Added load_packed_with_mapping() using logical expert IDs via mapping
Added load_packed_physical() legacy function for packed checkpoint format
Added load_packed() dispatch function that auto-detects checkpoint format
Dispatch logic: checks for "gate_up_proj" key to determine packed vs per-expert format
Verified: cargo check --features cuda passes

sempervictus · 2026-04-02T06:13:05Z

I think i found the reason MoE (unquantized) are taking up way more VRAM than they should when spread between several GPUs - it looks like every shard is loading all of the experts which would explain the memory occupancy balloon. Hoping this is the right path, would love some insight as to how rational or irrational this is @guoqingbao 😄

sempervictus · 2026-04-02T08:53:58Z

Confirm this (or at least some change to the loading mechanics since late Jan) fixes the Qwen3Next 80B loading on V100s at q8_0 - this OOMed previously:

vllm-rs --server --port 8000 --d 0,1,2,3 --m Qwen/Qwen3-Coder-Next --isq q8_0 --max-num-seqs 1 --max-model-len 262144 --max-tokens 131072 --temperature 0.5 --top-k 20 --top-p 0.95 --presence-penalty 1.2 --frequency-penalty 1.2 --prefix-cache --cpu-mem-fold 4 --allow-constraint-api --enable-tool-grammar --mtp-num-tokens 8

and despite the output stutter problem on V100s (since ~early February) which drops the overall decoding average, seeing:

[Seq 5] ⏱️ Prompt: 8941 tokens in 9.59s (932.62 t/s)
[Seq 5] ⏱️ Decoded: 1503 tokens in 76.89s (19.55 t/s)

on those same V100s at q8_0

sempervictus · 2026-04-02T09:01:47Z

Sorry wrong logs - disregard, more testing ongoing but i can say it doesnt crash on any of the test hosts from V100->SM121

guoqingbao · 2026-04-02T11:03:20Z

Sorry wrong logs - disregard, more testing ongoing but i can say it doesnt crash on any of the test hosts from V100->SM121

You load weights into DType::F32 in this PR, it should costs more GPU memory usage, not less. The current main shards weights into different GPUs but comes slight overhead, while, it has more overheads when you using VL models (we don't shard VL vision tower, intead, we replicate across difference GPUs).

sempervictus · 2026-04-02T13:22:17Z

🤔 I thought that was the scaling target 🤦‍♂️ - still not ooming on the v100s where I couldn't get this to run last time but I'll definitely dig in deeper

- Added build_initial_global_physical_to_logical_map() to create expert ID mappings with modulo wrapping for redundant experts - Added has_base_layer_prefix() helper for checkpoint format detection - Added make_expert_params_mapping() mirroring Python vLLM SharedFusedMoE.make_expert_params_mapping - Added make_expert_params_mapping_with_base_layer() for PEFT adapter support - Added make_expert_fp8_weight_scale_mapping() and make_expert_fp8_activation_scale_mapping() for FP8 quantization - Added load_fused_expert_weights() helper for per-expert weight loading - Added load_packed_with_mapping() using logical expert IDs via mapping - Added load_packed_physical() legacy function for packed checkpoint format - Added load_packed() dispatch function that auto-detects checkpoint format - Dispatch logic: checks for "gate_up_proj" key to determine packed vs per-expert format - Verified: cargo check --features cuda passes

sempervictus marked this pull request as ready for review April 2, 2026 08:46

sempervictus marked this pull request as draft April 2, 2026 13:22

sempervictus mentioned this pull request Apr 2, 2026

Support mxfp4 and nvfp4 models #285

Merged

sempervictus force-pushed the moe/mapped_expert_loading branch from 16a0ff8 to 87742b8 Compare April 2, 2026 19:51

sempervictus closed this Apr 16, 2026

guoqingbao mentioned this pull request Apr 20, 2026

Aarch64 Maturin Failure #307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE: Port PyVLLM MoE expert mapping logic#287

MoE: Port PyVLLM MoE expert mapping logic#287
sempervictus wants to merge 1 commit into
guoqingbao:mainfrom
sempervictus:moe/mapped_expert_loading

sempervictus commented Apr 2, 2026

Uh oh!

sempervictus commented Apr 2, 2026

Uh oh!

sempervictus commented Apr 2, 2026 •

edited

Loading

Uh oh!

sempervictus commented Apr 2, 2026 •

edited

Loading

Uh oh!

guoqingbao commented Apr 2, 2026

Uh oh!

sempervictus commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sempervictus commented Apr 2, 2026

Uh oh!

sempervictus commented Apr 2, 2026

Uh oh!

sempervictus commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Apr 2, 2026

Uh oh!

sempervictus commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sempervictus commented Apr 2, 2026 •

edited

Loading

sempervictus commented Apr 2, 2026 •

edited

Loading