Skip to content

[Model] Add MiniMax M3 support#45381

Open
youkaichao wants to merge 28 commits into
mainfrom
m3_release
Open

[Model] Add MiniMax M3 support#45381
youkaichao wants to merge 28 commits into
mainfrom
m3_release

Conversation

@youkaichao

@youkaichao youkaichao commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

  • Add MiniMax M3 model support across config, processors, model registry, AMD/NVIDIA model implementations, MTP, sparse attention, and warmup paths.
  • Add MiniMax M3 reasoning and tool parsers, including Rust frontend registrations and Python-facing parser wrappers.
  • Add supporting kernels, quantization paths, router GEMM shape support, and targeted tests.

Duplicate-work check

  • Open PR searches for MiniMax M3 and minimax_m3 found no duplicates. Broader M3 model results were unrelated.

FIX #45360

Tests

  • cargo fmt --manifest-path rust/Cargo.toml --all -- --check
  • cargo test --manifest-path rust/Cargo.toml -p vllm-reasoning-parser -p vllm-tool-parser -p vllm-chat

Notes

  • AI assistance was used to prepare this one-commit release branch and resolve conflicts against current main.

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--45381.org.readthedocs.build/en/45381/

Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Jun 14, 2026
The narrow DEP8-max sweep showed no GB200 advantage over B200 because both
cap at an 8-GPU NVLink island. Exploit NVL72's rack-scale NVLink with wide
expert parallelism spanning multiple nodes, mirroring the deepseek-v4
"megamoe" ladder (DEP = data-parallel attention + expert-parallel):

- 1P1D TP4 (2n)            low-latency, conc 4-64
- 1P1D DEP8 (4n)           mid, EP8/16-experts-per-rank, conc 128-512
- 1P1D DEP8->DEP16 (6n)    wide decode (EP16), conc 512-2048
- 2P1D DEP8->DEP16 (8n)    prefill-scaled, conc 2048-4096
- 4P1D DEP8->DEP16 (12n)   max throughput, conc 4096-8192

M3 has 128 routed experts (top-4), so EP8/EP16 shard cleanly. EP16 across
16 GPU / 4 nodes is the regime B200 physically can't reach.

Attention: FLASH_ATTN -> FLASHINFER (trtllm-gen) on all GB200 recipes to
exploit Blackwell. Requires the :minimax-m3 image rebuilt from m3_release
HEAD 022448dd (vllm-project/vllm#45381), which gates trtllm-gen page>=128.

Also add GB200 perf/NVLink-KV knobs from the deepseek-v4 reference:
numa-bind (Grace) and enable-sleep-mode (cuMem allocator so the KV cache is
IPC-exportable over the MNNVL fabric), alongside the existing UCX MNNVL env.

Replaces the four narrow EP4 recipes; keeps 1P1D TP4 for low latency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zyongye and others added 2 commits June 13, 2026 21:49
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
@mergify

mergify Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Hi @youkaichao, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3,
3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at
conc-8/128 for the decode-throughput uplift.

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which
carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at
server bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3,
3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at
conc-8/128 for the decode-throughput uplift.

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which
carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at
server bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3,
3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at
conc-8/128 for the decode-throughput uplift.

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which
carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at
server bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3,
3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at
conc-8/128 for the decode-throughput uplift.

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which
carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at
server bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3,
3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at
conc-8/128 for the decode-throughput uplift.

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which
carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at
server bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3,
3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at
conc-8/128 for the decode-throughput uplift.

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which
carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at
server bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter
Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env
VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate;
vllm_bench at conc-8/128 for the decode-throughput uplift.

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries
the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server
bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter
Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env
VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate;
vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline).

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries
the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server
bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter
Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env
VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate;
vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline).

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries
the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server
bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx added a commit to functionstackx/perf-eval that referenced this pull request Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4),
TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter
Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env
VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate;
vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline).

KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales,
so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562).

nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries
the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server
bring-up until then.

AI-assisted (Claude Code).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
…odel_config

VllmConfig may have model_config=None (e.g. backend-selector tests), which
made get_supported_kernel_block_sizes() raise AttributeError. Fall back to
the base [16, 32, 64] sizes when model_config is unavailable.

AI assistance (Claude) was used for this change.

Co-authored-by: Claude
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Yongye Zhu <yongye@inferact.ai>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
functionstackx added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Jun 15, 2026
…nector

Test recipe for vLLM v1's native CPU KV-cache offloading connector on
MiniMax-M3 MXFP8 (H200), using the agentic-coding scenario (Claude Code trace
replay via aiperf inferencex-agentx-mvp) at a single large concurrency.

Config (nvidia-master.yaml minimaxm3-fp8-h200-vllm-agentic):
  TEP8 (TP8 + expert parallel), offloading: cpu, conc 64, duration 1800s,
  on the day-zero vllm/vllm-openai:minimax-m3 image.

New script benchmarks/single_node/agentic/minimaxm3_fp8_h200.sh, modeled on the
M2.5 H200 agentic sibling with M3-specific serve flags:
  - --block-size 128 (mandatory for MSA sparse attention)
  - --language-model-only (text-only; frees VRAM for KV)
  - BF16 KV (no --kv-cache-dtype fp8: MXFP8 lacks calibrated KV scales and fp8
    KV corrupts output, vllm-project/vllm#45381)
  - prefix caching ENABLED (coding traces share large prefixes; offloading that
    cache to CPU is the point of the test)
  - CPU offload via vLLM native connector: --kv-offloading-backend native
    --kv-offloading-size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager
    (TOTAL_CPU_DRAM_GB default 600), same path as the M2.5 H200 agentic recipe

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@gau-nernst gau-nernst mentioned this pull request Jun 15, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation gpt-oss Related to GPT-OSS models multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia ready ONLY add when PR is ready to merge/full CI is needed rust speculative-decoding tool-calling v1

Projects

Status: Ready
Status: No status
Status: Ready

Development

Successfully merging this pull request may close these issues.

[Feature]: Support for MiniMax Sparse Attention (MSA)