[Model] Add MiniMax M3 support by youkaichao · Pull Request #45381 · vllm-project/vllm

youkaichao · 2026-06-12T07:39:34Z

Summary

Add MiniMax M3 model support across config, processors, model registry, AMD/NVIDIA model implementations, MTP, sparse attention, and warmup paths.
Add MiniMax M3 reasoning and tool parsers, including Rust frontend registrations and Python-facing parser wrappers.
Add supporting kernels, quantization paths, router GEMM shape support, and targeted tests.

Duplicate-work check

Open PR searches for MiniMax M3 and minimax_m3 found no duplicates. Broader M3 model results were unrelated.

FIX #45360

Tests

cargo fmt --manifest-path rust/Cargo.toml --all -- --check
cargo test --manifest-path rust/Cargo.toml -p vllm-reasoning-parser -p vllm-tool-parser -p vllm-chat

Notes

AI assistance was used to prepare this one-commit release branch and resolve conflicts against current main.

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: youkaichao <youkaichao@gmail.com>

mergify · 2026-06-12T07:40:13Z

Documentation preview: https://vllm--45381.org.readthedocs.build/en/45381/

The narrow DEP8-max sweep showed no GB200 advantage over B200 because both cap at an 8-GPU NVLink island. Exploit NVL72's rack-scale NVLink with wide expert parallelism spanning multiple nodes, mirroring the deepseek-v4 "megamoe" ladder (DEP = data-parallel attention + expert-parallel): - 1P1D TP4 (2n) low-latency, conc 4-64 - 1P1D DEP8 (4n) mid, EP8/16-experts-per-rank, conc 128-512 - 1P1D DEP8->DEP16 (6n) wide decode (EP16), conc 512-2048 - 2P1D DEP8->DEP16 (8n) prefill-scaled, conc 2048-4096 - 4P1D DEP8->DEP16 (12n) max throughput, conc 4096-8192 M3 has 128 routed experts (top-4), so EP8/EP16 shard cleanly. EP16 across 16 GPU / 4 nodes is the regime B200 physically can't reach. Attention: FLASH_ATTN -> FLASHINFER (trtllm-gen) on all GB200 recipes to exploit Blackwell. Requires the :minimax-m3 image rebuilt from m3_release HEAD 022448dd (vllm-project/vllm#45381), which gates trtllm-gen page>=128. Also add GB200 perf/NVLink-KV knobs from the deepseek-v4 reference: numa-bind (Grace) and enable-sleep-mode (cuMem allocator so the KV cache is IPC-exportable over the MNNVL fabric), alongside the existing UCX MNNVL env. Replaces the four narrow EP4 recipes; keeps 1P1D TP4 for low latency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

mergify · 2026-06-14T07:37:36Z

Hi @youkaichao, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com>

Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>

Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline). KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>

…odel_config VllmConfig may have model_config=None (e.g. backend-selector tests), which made get_supported_kernel_block_sizes() raise AttributeError. Fall back to the base [16, 32, 64] sizes when model_config is unavailable. AI assistance (Claude) was used for this change. Co-authored-by: Claude Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Yongye Zhu <yongye@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

…nector Test recipe for vLLM v1's native CPU KV-cache offloading connector on MiniMax-M3 MXFP8 (H200), using the agentic-coding scenario (Claude Code trace replay via aiperf inferencex-agentx-mvp) at a single large concurrency. Config (nvidia-master.yaml minimaxm3-fp8-h200-vllm-agentic): TEP8 (TP8 + expert parallel), offloading: cpu, conc 64, duration 1800s, on the day-zero vllm/vllm-openai:minimax-m3 image. New script benchmarks/single_node/agentic/minimaxm3_fp8_h200.sh, modeled on the M2.5 H200 agentic sibling with M3-specific serve flags: - --block-size 128 (mandatory for MSA sparse attention) - --language-model-only (text-only; frees VRAM for KV) - BF16 KV (no --kv-cache-dtype fp8: MXFP8 lacks calibrated KV scales and fp8 KV corrupts output, vllm-project/vllm#45381) - prefix caching ENABLED (coding traces share large prefixes; offloading that cache to CPU is the point of the test) - CPU offload via vLLM native connector: --kv-offloading-backend native --kv-offloading-size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager (TOTAL_CPU_DRAM_GB default 600), same path as the M2.5 H200 agentic recipe Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

[Model] Add MiniMax M3 support

c983460

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: youkaichao <youkaichao@gmail.com>

zyongye and others added 2 commits June 13, 2026 21:49

Merge branch 'main' into m3_release

a1c82dc

Lazy import

ac0835b

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

This was referenced Jun 14, 2026

[Bugfix][ROCm] Fix MiniMax-M3 FP8 KV cache dtype #45563

Open

[Bug]: ROCm MI300X FP8 KV cache MiniMax-M3-MXFP8 accuracy issues #45562

Open

Merge branch 'main' into m3_release

81aef91

This was referenced Jun 14, 2026

[ROCm] Add fused MiniMax M3 MXFP8 MoE for gfx94x #45567

Open

[Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X SemiAnalysisAI/InferenceX#1753

Open

Merge branch 'main' into m3_release

9724728

Merge branch 'main' into m3_release

979b56a

functionstackx mentioned this pull request Jun 14, 2026

[WIP] Add MiniMax-M3 Nightly Perf & Accuracy Regression Testing on MI355X with EAGLE3 spec decode [Blocked till https://github.com/vllm-project/vllm/pull/45381 merges] vllm-project/perf-eval#25

Draft

3 tasks

functionstackx mentioned this pull request Jun 15, 2026

[Klaud Cold] minimaxm3-fp8-h200-vllm-agentic: test CPU KV-offload connector on M3 (agentx, single large conc 64) SemiAnalysisAI/InferenceX#1763

Open

Merge branch 'main' into m3_release

b53c2cf

gau-nernst mentioned this pull request Jun 15, 2026

[Roadmap] Minimax M3 #45668

Open

8 tasks

Merge branch 'main' into m3_release

91e8f14

tjtanaa mentioned this pull request Jun 15, 2026

[ROCm][Quant] Minmax-M3: enable fp8_per_channel and fix SwiGLU-OAI fp8 MoE for bf16 weights on mi300x #45590

Open

4 tasks

Merge branch 'main' into m3_release

37fe719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model] Add MiniMax M3 support#45381

[Model] Add MiniMax M3 support#45381
youkaichao wants to merge 28 commits into
mainfrom
m3_release

youkaichao commented Jun 12, 2026 •

edited by jeejeelee

Loading

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

mergify Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Uh oh!

Conversation

youkaichao commented Jun 12, 2026 • edited by jeejeelee Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Duplicate-work check

Tests

Notes

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

mergify Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

youkaichao commented Jun 12, 2026 •

edited by jeejeelee

Loading