Skip to content

Add MoE config for Super B200 TP2#33510

Merged
mgoin merged 1 commit intovllm-project:mainfrom
shaharmor98:feat/add-super-moe-config
Feb 1, 2026
Merged

Add MoE config for Super B200 TP2#33510
mgoin merged 1 commit intovllm-project:mainfrom
shaharmor98:feat/add-super-moe-config

Conversation

@shaharmor98
Copy link
Copy Markdown
Contributor

@shaharmor98 shaharmor98 commented Feb 1, 2026

When locally running Nemotron Super on B200 the following warning appears:

Using default MoE config. Performance might be sub-optimal!

I used the benchmark_moe.py to create a JSON file for this use-case:

python benchmarks/kernels/benchmark_moe.py \
  --model $MODEL_PATH \
  --trust-remote-code \
  --tp-size 2 \
  --tune \
  --batch-size 1 2 4 8 16 24 32 48 64 96 128 256 512 768 1024 1536 \
  --save-dir /.../vllm/model_executor/layers/fused_moe/configs/

Related PRs:
#27967

Test Plan

Compare performance (vllm bench serve) with various batch sizes, with and without the JSON file.

Performance should be equal or better when the JSON is available.

Test Result

Absolute output tokens per second cannot be disclosed at this stage.
Instead, we'd report the gained diff.

Setup for all benchmarks: B200, TP2

Concurrency (Batch Size) Difference
16 +1.6%
64 +4.1%
128 +9.3%
512 +24.9%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Shahar Mor <smor@nvidia.com>
@dosubot
Copy link
Copy Markdown

dosubot Bot commented Feb 1, 2026

Related Documentation

No published documentation to review for changes on this repository.

Write your first living document

How did I do? Any feedback?  Join Discord

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Mixture of Experts (MoE) configuration file for the NVIDIA B200 GPU with a tensor parallelism size of 2. The configuration is generated by the project's benchmarking script and aims to optimize performance for MoE models on this specific hardware. The provided performance metrics show a significant improvement with the new configuration. The change is straightforward and appears to be a valuable performance enhancement.

Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for including benchmarks!

@mgoin mgoin enabled auto-merge (squash) February 1, 2026 15:42
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 1, 2026
@mgoin mgoin merged commit 8869cd8 into vllm-project:main Feb 1, 2026
50 checks passed
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
Signed-off-by: Pai <416932041@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants