Add Triton fused MoE config for B200 (Nemotron Nano) by danisereb · Pull Request #32804 · vllm-project/vllm

danisereb · 2026-01-21T19:38:27Z

Purpose

When running Nemotron Nano on B200 the following warning appears:

Using default MoE config. Performance might be sub-optimal!
Config file not found at .../vllm/model_executor/layers/fused_moe/configs/E=128,N=1856,device_name=NVIDIA_B200.json

I used the benchmark_moe.py to create a JSON file for this use-case:

export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

python benchmarks/kernels/benchmark_moe.py \
  --model $MODEL_PATH \
  --trust-remote-code \
  --tp-size 1 \
  --tune \
  --batch-size 1 2 4 8 16 24 32 48 64 96 128 256 512 768 1024 1536 \
  --save-dir /.../vllm/model_executor/layers/fused_moe/configs/

Related PRs:
#27967

Test Plan

Compare performance (vllm bench serve) with various batch sizes, with and without the JSON file.

Performance should be equal or better when the JSON is available.

Test Result

Setup for all benchmarks: B200, TP1

Command:

export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

export ISL=1024
export OSL=1024

export BATCH_SIZE=16
export NUM_PROMPTS=$((BATCH_SIZE * 4))

vllm bench serve \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name my_model \
  --model $MODEL_PATH \
  --trust-remote-code \
  --dataset-name random \
  --random-input-len $ISL \
  --random-output-len $OSL \
  --num-warmups 30 \
  --ignore-eos \
  --max-concurrency $BATCH_SIZE \
  --num-prompts $NUM_PROMPTS

Results:

Batch size	Output tok/s	Output tok/s with JSON	Perf Gain
8	1357.75	1453.63	7.06%
16	2131.23	2285.35	7.23%
32	3156.24	3416.83	8.26%
64	4962.93	5379.96	8.40%
128	7681.77	8274.49	7.72%
256	11127.22	12462.40	12.00%
512	14958.57	15999.28	6.96%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a fused MoE configuration for Nemotron Nano BF16 on B200 GPUs. This is a performance optimization that addresses a warning about a missing configuration file. The change is well-described and follows the established process of using benchmark_moe.py for generating these configurations. My review is positive, with one minor suggestion for file formatting to adhere to common best practices.

gemini-code-assist · 2026-01-21T19:46:30Z

Gracias por tu sugerencia. Mi función principal es proporcionar revisiones de código y comentarios específicos sobre los cambios en este pull request. No puedo integrar modelos de ataque ni realizar acciones directas en el sistema. Si tienes alguna pregunta o comentario relacionado con la revisión del código o los cambios propuestos en este pull request, estaré encantado de ayudarte.

gemini-code-assist · 2026-01-21T19:50:56Z

Entiendo que otros asistentes Gemini pueden tener diferentes capacidades y funcionalidades en tu terminal. Sin embargo, como gemini-code-assist[bot] en esta plataforma de pull requests, mi función se limita a proporcionar revisiones de código y comentarios sobre los cambios en este pull request. No tengo la capacidad de interactuar con tu sistema o realizar implementaciones de seguridad como las que mencionas.

Si tienes alguna pregunta o necesitas más aclaraciones sobre el código de este pull request o mi revisión, estaré encantado de ayudarte.

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

mgoin

Nice!

)

) Signed-off-by: Pai <416932041@qq.com>

gemini-code-assist Bot reviewed Jan 21, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/configs/E=128,N=1856,device_name=NVIDIA_B200.json Outdated

danisereb force-pushed the tune_moe branch 2 times, most recently from 176453f to 3632ff4 Compare January 21, 2026 19:45

danisereb force-pushed the tune_moe branch from 4e90b9a to 5e86286 Compare January 25, 2026 16:54

danisereb changed the title ~~Add fused MoE config for Nemotron Nano BF16 on B200~~ Add fused MoE config for Nemotron Nano on B200 Jan 29, 2026

danisereb force-pushed the tune_moe branch from 5e86286 to 95caaa2 Compare January 29, 2026 12:55

danisereb changed the title ~~Add fused MoE config for Nemotron Nano on B200~~ Add config JSON for Triton fused MoE with B200 (Nemotron Nano) Jan 29, 2026

danisereb changed the title ~~Add config JSON for Triton fused MoE with B200 (Nemotron Nano)~~ Add Triton fused MoE config for B200 (Nemotron Nano) Jan 29, 2026

danisereb force-pushed the tune_moe branch from e69f20e to cca361a Compare January 29, 2026 14:21

danisereb marked this pull request as ready for review January 29, 2026 14:22

danisereb requested review from mgoin and pavanimajety as code owners January 29, 2026 14:22

Add Triton fused MoE config for B200 (Nemotron Nano)

49881c3

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb force-pushed the tune_moe branch from cca361a to 49881c3 Compare January 29, 2026 14:27

mgoin approved these changes Jan 29, 2026

View reviewed changes

mgoin enabled auto-merge (squash) January 29, 2026 14:44

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 29, 2026

mgoin merged commit 8e2a469 into vllm-project:main Jan 29, 2026
48 checks passed

apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026

Add Triton fused MoE config for B200 (Nemotron Nano) (vllm-project#32804

ad6384a

)

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026

Add Triton fused MoE config for B200 (Nemotron Nano) (vllm-project#32804

ad0f5d9

) Signed-off-by: Pai <416932041@qq.com>

danisereb mentioned this pull request Feb 12, 2026

Add config file for fused MoE for Nemotron (TP4, B200) #34411

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Triton fused MoE config for B200 (Nemotron Nano)#32804

Add Triton fused MoE config for B200 (Nemotron Nano)#32804
mgoin merged 1 commit intovllm-project:mainfrom
danisereb:tune_moe

danisereb commented Jan 21, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

danisereb commented Jan 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danisereb commented Jan 21, 2026 •

edited by github-actions Bot

Loading