From 791c877bf98efd37c01605417a0a76a168ecdce2 Mon Sep 17 00:00:00 2001
From: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Date: Tue, 7 Apr 2026 21:01:22 -0700
Subject: [PATCH] MiniMax-M2.5: update B200 FP8 serving config

Add benchmark-validated flags for B200 FP8 from SemiAnalysisAI/InferenceX#1010:
--enable-expert-parallel (tp:4/ep:4 validated, tp:2/ep:2 also supported),
--gpu-memory-utilization 0.90, --block-size 32, --kv-cache-dtype fp8,
--stream-interval 20, --no-enable-prefix-caching.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
---
 MiniMax/MiniMax-M2.5.md | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/MiniMax/MiniMax-M2.5.md b/MiniMax/MiniMax-M2.5.md
index 17edc5d8..79328eab 100644
--- a/MiniMax/MiniMax-M2.5.md
+++ b/MiniMax/MiniMax-M2.5.md
@@ -34,16 +34,21 @@ MiniMax-M2.5 can be run on different GPU configurations. The recommended setup u
 
 ### B200 (FP8)
 
+Recommended configuration uses 4 GPUs with tensor and expert parallelism. A 2-GPU configuration (`--tensor-parallel-size 2 --enable-expert-parallel`) is also supported.
+
 ```bash
 docker run --gpus all \
   -p 8000:8000 \
   --ipc=host \
   -v ~/.cache/huggingface:/root/.cache/huggingface \
-  vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \
+  vllm/vllm-openai:latest MiniMaxAI/MiniMax-M2.5 \
       --tensor-parallel-size 4 \
-      --tool-call-parser minimax_m2 \
-      --reasoning-parser minimax_m2_append_think \
-      --enable-auto-tool-choice \
+      --enable-expert-parallel \
+      --gpu-memory-utilization 0.90 \
+      --block-size 32 \
+      --kv-cache-dtype fp8 \
+      --stream-interval 20 \
+      --no-enable-prefix-caching \
       --trust-remote-code
 ```