From 791c877bf98efd37c01605417a0a76a168ecdce2 Mon Sep 17 00:00:00 2001 From: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Date: Tue, 7 Apr 2026 21:01:22 -0700 Subject: [PATCH] MiniMax-M2.5: update B200 FP8 serving config Add benchmark-validated flags for B200 FP8 from SemiAnalysisAI/InferenceX#1010: --enable-expert-parallel (tp:4/ep:4 validated, tp:2/ep:2 also supported), --gpu-memory-utilization 0.90, --block-size 32, --kv-cache-dtype fp8, --stream-interval 20, --no-enable-prefix-caching. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> --- MiniMax/MiniMax-M2.5.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/MiniMax/MiniMax-M2.5.md b/MiniMax/MiniMax-M2.5.md index 17edc5d8..79328eab 100644 --- a/MiniMax/MiniMax-M2.5.md +++ b/MiniMax/MiniMax-M2.5.md @@ -34,16 +34,21 @@ MiniMax-M2.5 can be run on different GPU configurations. The recommended setup u ### B200 (FP8) +Recommended configuration uses 4 GPUs with tensor and expert parallelism. A 2-GPU configuration (`--tensor-parallel-size 2 --enable-expert-parallel`) is also supported. + ```bash docker run --gpus all \ -p 8000:8000 \ --ipc=host \ -v ~/.cache/huggingface:/root/.cache/huggingface \ - vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \ + vllm/vllm-openai:latest MiniMaxAI/MiniMax-M2.5 \ --tensor-parallel-size 4 \ - --tool-call-parser minimax_m2 \ - --reasoning-parser minimax_m2_append_think \ - --enable-auto-tool-choice \ + --enable-expert-parallel \ + --gpu-memory-utilization 0.90 \ + --block-size 32 \ + --kv-cache-dtype fp8 \ + --stream-interval 20 \ + --no-enable-prefix-caching \ --trust-remote-code ```