[NemotronH] Do not force router to run in fp32 by roikoren755 · Pull Request #34582 · vllm-project/vllm

roikoren755 · 2026-02-15T13:01:49Z

Purpose

Current code forces the MoE router computation to FP32, even though checkpoints have it in bfloat16. This takes up about 40% of the forward pass, under normal workloads, and does not provide an accuracy boost.

This PR removes this limitations.

Test Plan

No additional tests, all tests pass, accuracy does not degrade

Test Result

All tests pass, accuracy did not degrade

Running GSM8K, got the following results.

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5671|±  |0.0136|
|     |       |strict-match    |     5|exact_match|↑  |0.8431|±  |0.0100|

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5572|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8453|±  |0.0100|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roi Koren <roik@nvidia.com>

gemini-code-assist

Code Review

This pull request introduces a valuable performance optimization by removing the forced casting of MoE router logits to float32. The changes in nemotron_h.py correctly implement this, and the special case for DeepSeekV3 is properly handled in flashinfer_trtllm_moe.py. I've found one minor issue: a leftover debug print statement that should be removed.

gemini-code-assist · 2026-02-15T13:02:40Z

vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py

    from vllm.utils.flashinfer import flashinfer_trtllm_fp8_per_tensor_scale_moe

+    # The DeepSeekV3 routing method requires float32 router logits.
+    print(routing_method_type)


This print statement appears to be for debugging purposes and should be removed before merging to avoid polluting logs.

tomeras91 · 2026-02-15T13:07:39Z

vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py

+        routing_logits = routing_logits.to(torch.float32)
+
+    if routing_bias is not None:
+        routing_bias = routing_bias.to(hidden_states.dtype)


I remember something about it being important that the bias is in FP32.. I understand in this case we first cast logits to FP32 (since we're using DS routing) so the bias is actually in FP32, but doesn't it make more sense to cast logits to bias dtype instead of the other way around?

Signed-off-by: Roi Koren <roik@nvidia.com>

mgoin

LGTM, nice find! We maybe should add an assert on the bias, but this roughly matches other trtllm moe impls. Do you have any perf result? You mentioned this takes 40% of time, which doesn't make sense to me

roikoren755 · 2026-02-16T17:50:10Z

LGTM, nice find! We maybe should add an assert on the bias, but this roughly matches other trtllm moe impls. Do you have any perf result? You mentioned this takes 40% of time, which doesn't make sense to me

40% might have been an exaggeration 😅
But in an example workload, with TP8, we saw the following kernel distribution:

Keep in mind this is with an older commit, before the TRTLLM-Gen kernels were merged. The second kernel, taking ~19% of the profile, is the fp32 router

…34582)" This reverts commit 3b30e61.

…34582)" This reverts commit 3b30e61. Signed-off-by: Roi Koren <roik@nvidia.com>

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Signed-off-by: Roi Koren <roik@nvidia.com>

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Signed-off-by: Roi Koren <roik@nvidia.com>

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

roikoren755 requested review from mgoin and pavanimajety as code owners February 15, 2026 13:01

roikoren755 added 2 commits February 15, 2026 15:02

Do not force NemotronH router to fp32

608fbdf

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix router logits dtype mismatch

48ab68e

Signed-off-by: Roi Koren <roik@nvidia.com>

roikoren755 force-pushed the feat/nemotronh-bf16-router branch from 2fb9690 to 48ab68e Compare February 15, 2026 13:02

mergify bot added the nvidia label Feb 15, 2026

github-project-automation bot added this to NVIDIA Feb 15, 2026

gemini-code-assist bot reviewed Feb 15, 2026

View reviewed changes

tomeras91 reviewed Feb 15, 2026

View reviewed changes

roikoren755 added 2 commits February 15, 2026 19:25

Delete debugging print

8301f2c

Signed-off-by: Roi Koren <roik@nvidia.com>

Remove unnecessary .to call

7e58b85

Signed-off-by: Roi Koren <roik@nvidia.com>

mgoin approved these changes Feb 16, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 16, 2026

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Feb 16, 2026

vllm-bot merged commit 3b30e61 into vllm-project:main Feb 16, 2026
62 of 68 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 16, 2026

roikoren755 deleted the feat/nemotronh-bf16-router branch February 17, 2026 14:09

edwinlim0919 mentioned this pull request Feb 18, 2026

[Bug]: specualative decoding error in 0.15.1 #34607

Closed

1 task

roikoren755 added a commit to roikoren755/vllm that referenced this pull request Feb 18, 2026

Revert "[NemotronH] Do not force router to run in fp32 (vllm-project#…

828504e

…34582)" This reverts commit 3b30e61.

roikoren755 mentioned this pull request Feb 18, 2026

Revert "[NemotronH] Do not force router to run in fp32 (#34582)" #34808

Merged

5 tasks

roikoren755 added a commit to roikoren755/vllm that referenced this pull request Feb 18, 2026

Revert "[NemotronH] Do not force router to run in fp32 (vllm-project#…

a57c2eb

…34582)" This reverts commit 3b30e61. Signed-off-by: Roi Koren <roik@nvidia.com>

wzhao18 pushed a commit to wzhao18/vllm that referenced this pull request Feb 18, 2026

[NemotronH] Do not force router to run in fp32 (vllm-project#34582)

9dfdfb1

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

vllm-bot pushed a commit that referenced this pull request Feb 19, 2026

Revert "[NemotronH] Do not force router to run in fp32 (#34582)" (#34808

3eff45d

) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026

[NemotronH] Do not force router to run in fp32 (vllm-project#34582)

c1078c8

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026

Revert "[NemotronH] Do not force router to run in fp32 (vllm-project#…

b8db2c6

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026

[NemotronH] Do not force router to run in fp32 (vllm-project#34582)

0b73ac3

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[NemotronH] Do not force router to run in fp32 (vllm-project#34582)

26ca4f9

Signed-off-by: Roi Koren <roik@nvidia.com>

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

Revert "[NemotronH] Do not force router to run in fp32 (vllm-project#…

352efbd

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[NemotronH] Do not force router to run in fp32 (vllm-project#34582)

6eeb810

Signed-off-by: Roi Koren <roik@nvidia.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

Revert "[NemotronH] Do not force router to run in fp32 (vllm-project#…

053fa67

…34582)" (vllm-project#34808) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026

[NemotronH] Do not force router to run in fp32 (vllm-project#34582)

e754b70

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NemotronH] Do not force router to run in fp32#34582

[NemotronH] Do not force router to run in fp32#34582
vllm-bot merged 4 commits intovllm-project:mainfrom
roikoren755:feat/nemotronh-bf16-router

roikoren755 commented Feb 15, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 15, 2026

Uh oh!

tomeras91 Feb 15, 2026

Uh oh!

mgoin left a comment

Uh oh!

roikoren755 commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

roikoren755 commented Feb 15, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

tomeras91 Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

roikoren755 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

roikoren755 commented Feb 15, 2026 •

edited by github-actions bot

Loading

roikoren755 commented Feb 16, 2026 •

edited

Loading