Add B200 GPU configuration for MiniMax-M2.7#334
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the documentation to include support for NVIDIA B200 GPUs and provides a recommended configuration for running the MiniMax-M2.7 model on them. Feedback suggests including the --compilation-config flag to enable specific optimizations and adjusting the reasoning-parser value to maintain consistency with other examples in the guide.
| VLLM_FLOAT32_MATMUL_PRECISION="high" vllm serve MiniMaxAI/MiniMax-M2.7 \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size 4 \ | ||
| --enable-auto-tool-choice \ | ||
| --tool-call-parser minimax_m2 \ | ||
| --reasoning-parser minimax_m2_append_think |
There was a problem hiding this comment.
The B200 configuration is missing the --compilation-config which includes the fuse_minimax_qk_norm optimization recommended for this model series (as noted in line 184). Additionally, the reasoning-parser value minimax_m2_append_think is inconsistent with all other examples in this guide (e.g., lines 70, 126, 153) which use minimax_m2. It is recommended to maintain consistency across the documentation unless this specific parser is required for B200.
| VLLM_FLOAT32_MATMUL_PRECISION="high" vllm serve MiniMaxAI/MiniMax-M2.7 \ | |
| --trust-remote-code \ | |
| --tensor-parallel-size 4 \ | |
| --enable-auto-tool-choice \ | |
| --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2_append_think | |
| VLLM_FLOAT32_MATMUL_PRECISION="high" vllm serve MiniMaxAI/MiniMax-M2.7 \ | |
| --tensor-parallel-size 4 \ | |
| --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2 \ | |
| --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code |
| --tensor-parallel-size 4 \ | ||
| --enable-auto-tool-choice \ | ||
| --tool-call-parser minimax_m2 \ | ||
| --reasoning-parser minimax_m2_append_think |
There was a problem hiding this comment.
This is quite different from the previous configuration
vllm serve MiniMaxAI/MiniMax-M2.7 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--enable-auto-tool-choice \
--trust-remote-code
Have you tested that,
- we need
minimax_m2_append_think - we cannot do
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}'on blackwell?
There was a problem hiding this comment.
We can remove minimax_m2_append_think and add --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}'. Though the fuse_minimax_qk_norm is not yet available on latest vllm release.
There was a problem hiding this comment.
My main concern is we have different configurations for Hpper and Blackwell and I would like to educate the users on the minimal difference between the two and explain the rationale.
There was a problem hiding this comment.
yes we will keep them consistent. Thanks for flagging this out.
There was a problem hiding this comment.
Thanks. please ping me on Slack when it's updated, thx
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com>
53acbc2 to
ec6b0f0
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com>
Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: haic0 <haichzha@amd.com>
Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: haic0 <haichzha@amd.com>
No description provided.