Skip to content

MoE: Port PyVLLM MoE expert mapping logic#287

Closed
sempervictus wants to merge 1 commit into
guoqingbao:mainfrom
sempervictus:moe/mapped_expert_loading
Closed

MoE: Port PyVLLM MoE expert mapping logic#287
sempervictus wants to merge 1 commit into
guoqingbao:mainfrom
sempervictus:moe/mapped_expert_loading

Conversation

@sempervictus

Copy link
Copy Markdown
Contributor
  • Added build_initial_global_physical_to_logical_map() to create expert ID mappings with modulo wrapping for redundant experts
  • Added has_base_layer_prefix() helper for checkpoint format detection
  • Added make_expert_params_mapping() mirroring Python vLLM SharedFusedMoE.make_expert_params_mapping
  • Added make_expert_params_mapping_with_base_layer() for PEFT adapter support
  • Added make_expert_fp8_weight_scale_mapping() and make_expert_fp8_activation_scale_mapping() for FP8 quantization
  • Added load_fused_expert_weights() helper for per-expert weight loading
  • Added load_packed_with_mapping() using logical expert IDs via mapping
  • Added load_packed_physical() legacy function for packed checkpoint format
  • Added load_packed() dispatch function that auto-detects checkpoint format
  • Dispatch logic: checks for "gate_up_proj" key to determine packed vs per-expert format
  • Verified: cargo check --features cuda passes

@sempervictus

Copy link
Copy Markdown
Contributor Author

I think i found the reason MoE (unquantized) are taking up way more VRAM than they should when spread between several GPUs - it looks like every shard is loading all of the experts which would explain the memory occupancy balloon. Hoping this is the right path, would love some insight as to how rational or irrational this is @guoqingbao 😄

@sempervictus sempervictus marked this pull request as ready for review April 2, 2026 08:46
@sempervictus

sempervictus commented Apr 2, 2026

Copy link
Copy Markdown
Contributor Author

Confirm this (or at least some change to the loading mechanics since late Jan) fixes the Qwen3Next 80B loading on V100s at q8_0 - this OOMed previously:

vllm-rs --server --port 8000 --d 0,1,2,3 --m Qwen/Qwen3-Coder-Next --isq q8_0 --max-num-seqs 1 --max-model-len 262144 --max-tokens 131072 --temperature 0.5 --top-k 20 --top-p 0.95 --presence-penalty 1.2 --frequency-penalty 1.2 --prefix-cache --cpu-mem-fold 4 --allow-constraint-api --enable-tool-grammar --mtp-num-tokens 8

and despite the output stutter problem on V100s (since ~early February) which drops the overall decoding average, seeing:

[Seq 5] ⏱️ Prompt: 8941 tokens in 9.59s (932.62 t/s)
[Seq 5] ⏱️ Decoded: 1503 tokens in 76.89s (19.55 t/s)

on those same V100s at q8_0

@sempervictus

sempervictus commented Apr 2, 2026

Copy link
Copy Markdown
Contributor Author

Sorry wrong logs - disregard, more testing ongoing but i can say it doesnt crash on any of the test hosts from V100->SM121

@guoqingbao

Copy link
Copy Markdown
Owner

Sorry wrong logs - disregard, more testing ongoing but i can say it doesnt crash on any of the test hosts from V100->SM121

You load weights into DType::F32 in this PR, it should costs more GPU memory usage, not less. The current main shards weights into different GPUs but comes slight overhead, while, it has more overheads when you using VL models (we don't shard VL vision tower, intead, we replicate across difference GPUs).

@sempervictus

Copy link
Copy Markdown
Contributor Author

🤔 I thought that was the scaling target 🤦‍♂️ - still not ooming on the v100s where I couldn't get this to run last time but I'll definitely dig in deeper

@sempervictus sempervictus marked this pull request as draft April 2, 2026 13:22
- Added build_initial_global_physical_to_logical_map() to create
expert ID mappings with modulo wrapping for redundant experts
- Added has_base_layer_prefix() helper for checkpoint format
detection
- Added make_expert_params_mapping() mirroring Python vLLM
SharedFusedMoE.make_expert_params_mapping
- Added make_expert_params_mapping_with_base_layer() for PEFT
adapter support
- Added make_expert_fp8_weight_scale_mapping() and
make_expert_fp8_activation_scale_mapping() for FP8 quantization
- Added load_fused_expert_weights() helper for per-expert weight
loading
- Added load_packed_with_mapping() using logical expert IDs via
mapping
- Added load_packed_physical() legacy function for packed checkpoint
format
- Added load_packed() dispatch function that auto-detects checkpoint
format
- Dispatch logic: checks for "gate_up_proj" key to determine packed
vs per-expert format
- Verified: cargo check --features cuda passes
@sempervictus sempervictus force-pushed the moe/mapped_expert_loading branch from 16a0ff8 to 87742b8 Compare April 2, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants