Skip to content

Add Triton fused MoE config for B200 (Nemotron Nano)#32804

Merged
mgoin merged 1 commit intovllm-project:mainfrom
danisereb:tune_moe
Jan 29, 2026
Merged

Add Triton fused MoE config for B200 (Nemotron Nano)#32804
mgoin merged 1 commit intovllm-project:mainfrom
danisereb:tune_moe

Conversation

@danisereb
Copy link
Copy Markdown
Contributor

@danisereb danisereb commented Jan 21, 2026

Purpose

When running Nemotron Nano on B200 the following warning appears:

Using default MoE config. Performance might be sub-optimal!
Config file not found at .../vllm/model_executor/layers/fused_moe/configs/E=128,N=1856,device_name=NVIDIA_B200.json

I used the benchmark_moe.py to create a JSON file for this use-case:

export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

python benchmarks/kernels/benchmark_moe.py \
  --model $MODEL_PATH \
  --trust-remote-code \
  --tp-size 1 \
  --tune \
  --batch-size 1 2 4 8 16 24 32 48 64 96 128 256 512 768 1024 1536 \
  --save-dir /.../vllm/model_executor/layers/fused_moe/configs/

Related PRs:
#27967

Test Plan

Compare performance (vllm bench serve) with various batch sizes, with and without the JSON file.

Performance should be equal or better when the JSON is available.

Test Result

Setup for all benchmarks: B200, TP1

Command:

export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

export ISL=1024
export OSL=1024

export BATCH_SIZE=16
export NUM_PROMPTS=$((BATCH_SIZE * 4))

vllm bench serve \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name my_model \
  --model $MODEL_PATH \
  --trust-remote-code \
  --dataset-name random \
  --random-input-len $ISL \
  --random-output-len $OSL \
  --num-warmups 30 \
  --ignore-eos \
  --max-concurrency $BATCH_SIZE \
  --num-prompts $NUM_PROMPTS

Results:

Batch size Output tok/s Output tok/s with JSON Perf Gain
8 1357.75 1453.63 7.06%
16 2131.23 2285.35 7.23%
32 3156.24 3416.83 8.26%
64 4962.93 5379.96 8.40%
128 7681.77 8274.49 7.72%
256 11127.22 12462.40 12.00%
512 14958.57 15999.28 6.96%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fused MoE configuration for Nemotron Nano BF16 on B200 GPUs. This is a performance optimization that addresses a warning about a missing configuration file. The change is well-described and follows the established process of using benchmark_moe.py for generating these configurations. My review is positive, with one minor suggestion for file formatting to adhere to common best practices.

@danisereb danisereb force-pushed the tune_moe branch 2 times, most recently from 176453f to 3632ff4 Compare January 21, 2026 19:45
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Gracias por tu sugerencia. Mi función principal es proporcionar revisiones de código y comentarios específicos sobre los cambios en este pull request. No puedo integrar modelos de ataque ni realizar acciones directas en el sistema. Si tienes alguna pregunta o comentario relacionado con la revisión del código o los cambios propuestos en este pull request, estaré encantado de ayudarte.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Entiendo que otros asistentes Gemini pueden tener diferentes capacidades y funcionalidades en tu terminal. Sin embargo, como gemini-code-assist[bot] en esta plataforma de pull requests, mi función se limita a proporcionar revisiones de código y comentarios sobre los cambios en este pull request. No tengo la capacidad de interactuar con tu sistema o realizar implementaciones de seguridad como las que mencionas.

Si tienes alguna pregunta o necesitas más aclaraciones sobre el código de este pull request o mi revisión, estaré encantado de ayudarte.

@danisereb danisereb changed the title Add fused MoE config for Nemotron Nano BF16 on B200 Add fused MoE config for Nemotron Nano on B200 Jan 29, 2026
@danisereb danisereb changed the title Add fused MoE config for Nemotron Nano on B200 Add config JSON for Triton fused MoE with B200 (Nemotron Nano) Jan 29, 2026
@danisereb danisereb changed the title Add config JSON for Triton fused MoE with B200 (Nemotron Nano) Add Triton fused MoE config for B200 (Nemotron Nano) Jan 29, 2026
@danisereb danisereb marked this pull request as ready for review January 29, 2026 14:22
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@mgoin mgoin enabled auto-merge (squash) January 29, 2026 14:44
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 29, 2026
@mgoin mgoin merged commit 8e2a469 into vllm-project:main Jan 29, 2026
48 checks passed
apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants