[Model] Add MiniMax M3 support#45381
Open
youkaichao wants to merge 28 commits into
Open
Conversation
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: youkaichao <youkaichao@gmail.com>
Contributor
|
Documentation preview: https://vllm--45381.org.readthedocs.build/en/45381/ |
Oseltamivir
added a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
Jun 14, 2026
The narrow DEP8-max sweep showed no GB200 advantage over B200 because both cap at an 8-GPU NVLink island. Exploit NVL72's rack-scale NVLink with wide expert parallelism spanning multiple nodes, mirroring the deepseek-v4 "megamoe" ladder (DEP = data-parallel attention + expert-parallel): - 1P1D TP4 (2n) low-latency, conc 4-64 - 1P1D DEP8 (4n) mid, EP8/16-experts-per-rank, conc 128-512 - 1P1D DEP8->DEP16 (6n) wide decode (EP16), conc 512-2048 - 2P1D DEP8->DEP16 (8n) prefill-scaled, conc 2048-4096 - 4P1D DEP8->DEP16 (12n) max throughput, conc 4096-8192 M3 has 128 routed experts (top-4), so EP8/EP16 shard cleanly. EP16 across 16 GPU / 4 nodes is the regime B200 physically can't reach. Attention: FLASH_ATTN -> FLASHINFER (trtllm-gen) on all GB200 recipes to exploit Blackwell. Requires the :minimax-m3 image rebuilt from m3_release HEAD 022448dd (vllm-project/vllm#45381), which gates trtllm-gen page>=128. Also add GB200 perf/NVLink-KV knobs from the deepseek-v4 reference: numa-bind (Grace) and enable-sleep-mode (cuMem allocator so the KV cache is IPC-exportable over the MNNVL fabric), alongside the existing UCX MNNVL env. Replaces the four narrow EP4 recipes; keeps 1P1D TP4 for low latency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
This was referenced Jun 14, 2026
Contributor
|
Hi @youkaichao, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 + EP, with EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens). gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). WIP -- nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench at conc-8/128 for the decode-throughput uplift. KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline). KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline). KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
functionstackx
added a commit
to functionstackx/perf-eval
that referenced
this pull request
Jun 14, 2026
Adds workloads/minimax_m3_mi355x.yaml: MiniMax-M3-MXFP8 on MI355X (CDNA4), TP=8 (no expert parallelism), EAGLE3 speculative decode (drafter Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), env VLLM_USE_BREAKABLE_CUDAGRAPH=0. gsm8k as the spec-decode correctness gate; vllm_bench 8k-in/1k-out at conc-128 (MI355X-family baseline). KV cache left at default (bf16): MiniMax-M3-MXFP8 ships no calibrated KV scales, so --kv-cache-dtype fp8 silently corrupts output (vllm-project/vllm#45562). nightly: true, but blocked on vllm-project/vllm#45381 (M3 support, which carries the ROCm/AMD EAGLE3 enablement #45546) landing on main; it will fail at server bring-up until then. AI-assisted (Claude Code). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
…odel_config VllmConfig may have model_config=None (e.g. backend-selector tests), which made get_supported_kernel_block_sizes() raise AttributeError. Fall back to the base [16, 32, 64] sizes when model_config is unavailable. AI assistance (Claude) was used for this change. Co-authored-by: Claude Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Yongye Zhu <yongye@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
functionstackx
added a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
Jun 15, 2026
…nector
Test recipe for vLLM v1's native CPU KV-cache offloading connector on
MiniMax-M3 MXFP8 (H200), using the agentic-coding scenario (Claude Code trace
replay via aiperf inferencex-agentx-mvp) at a single large concurrency.
Config (nvidia-master.yaml minimaxm3-fp8-h200-vllm-agentic):
TEP8 (TP8 + expert parallel), offloading: cpu, conc 64, duration 1800s,
on the day-zero vllm/vllm-openai:minimax-m3 image.
New script benchmarks/single_node/agentic/minimaxm3_fp8_h200.sh, modeled on the
M2.5 H200 agentic sibling with M3-specific serve flags:
- --block-size 128 (mandatory for MSA sparse attention)
- --language-model-only (text-only; frees VRAM for KV)
- BF16 KV (no --kv-cache-dtype fp8: MXFP8 lacks calibrated KV scales and fp8
KV corrupts output, vllm-project/vllm#45381)
- prefix caching ENABLED (coding traces share large prefixes; offloading that
cache to CPU is the point of the test)
- CPU offload via vLLM native connector: --kv-offloading-backend native
--kv-offloading-size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager
(TOTAL_CPU_DRAM_GB default 600), same path as the M2.5 H200 agentic recipe
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Duplicate-work check
MiniMax M3andminimax_m3found no duplicates. BroaderM3 modelresults were unrelated.FIX #45360
Tests
cargo fmt --manifest-path rust/Cargo.toml --all -- --checkcargo test --manifest-path rust/Cargo.toml -p vllm-reasoning-parser -p vllm-tool-parser -p vllm-chatNotes