[Model] Add MoE support for NemotronH by tomeras91 · Pull Request #25863 · vllm-project/vllm

tomeras91 · 2025-09-29T08:15:33Z

Purpose

Add support for an MoE module in the NemotronH architecture.
This MoE module is relatively unique (to the best of my knowledge, comparable only to nomic-ai/nomic-embed-text-v2-moe), as it uses a non-gated Squared ReLU activation function.

In this PR:

Add an NemotronHMoE module to the NemotronH modeling file
Add the option to use non-gated MoE from the FusedMoE class (in addition to by calling the fused_moe function directly)
Add support for the Squared ReLU activation function in the MoE triton path
Add support for Squared ReLU non-gated FP8 MoE in ModelOptFp8MoEMethod quant_method, currently only in the triton path

…edReLu activation - adapt the FusedMoE object to support is_act_and_mul=False Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…s an attribute in FusedMoE Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request adds support for a non-gated Squared ReLU MoE module in the NemotronH architecture, which is a valuable enhancement. The changes are mostly well-implemented across the fused MoE layers and model definition. However, I've identified a critical bug in the forward pass of the new NemotronHMoE module related to incorrect floating-point computation and a potential UnboundLocalError. I've provided a detailed comment with a suggested fix for this issue. Addressing this is crucial for the correctness of the model's output.

vllm/model_executor/models/nemotron_h.py

tomeras91 · 2025-09-29T08:27:35Z

vllm/model_executor/models/nemotron_h.py

To the reviewer(s)

NemotronHForCausalLM now optionally has an MoE block. I was wondering if it should implement the MixtureOfExperts interface or not. Do you have any guidance?

We might need to something similar to this PR #25311 (comment), where is_mixture_of_experts depends on an attribute of the model. I don't know all the cases where this is used though

Thanks for the input. Done
RE your comment in the other PR - I think that checking whether getattr(model, "num_moe_layers", 0) > 0 in is_mixture_of_experts makes sense, since all models implementing MixtureOfExperts are expected to initialize num_moe_layers as it is part of the interface. So I don't think it is too fragile.

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…xperts Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

mergify · 2025-10-14T04:32:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomeras91.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tlrmchlsmth · 2025-10-15T20:59:23Z

vllm/model_executor/layers/fused_moe/layer.py

+        if not self.moe_config.is_act_and_mul:
+            # Avoid circular import
+            from vllm.model_executor.layers.quantization.modelopt import (
+                ModelOptFp8MoEMethod,
+            )
+
+            if not isinstance(
+                quant_method, (UnquantizedFusedMoEMethod, ModelOptFp8MoEMethod)
+            ):
+                raise NotImplementedError(
+                    "is_act_and_mul=False is supported only for unquantized "
+                    "and ModelOpt FP8 moe for now"
+                )
+            if not current_platform.is_cuda():
+                raise NotImplementedError(
+                    "is_act_and_mul=False is supported only for CUDA for now"
+                )


What are the blockers for supporting is_act_and_mul = False more generally?

Creating the relevant kernels :) We plan to follow up with that

tlrmchlsmth · 2025-10-15T21:02:46Z

vllm/model_executor/layers/quantization/modelopt.py

+        if (
+            envs.VLLM_USE_FLASHINFER_MOE_FP8
+            and has_flashinfer_moe()
+            and self.moe.is_act_and_mul
+        ):


For NemotronH, self.flashinfer_moe_backend will end up being None. What implementation ends up getting used in this case?

triton kernels. This is currently the only code path available with is_act_and_mul=False

I suspect this is going to be very complicated to add to all the quant and kernel backends

Agreed. We can follow-up on this discussion internally

tlrmchlsmth · 2025-10-15T21:06:00Z

vllm/model_executor/models/nemotron_h.py

We might need to something similar to this PR #25311 (comment), where is_mixture_of_experts depends on an attribute of the model. I don't know all the cases where this is used though

tlrmchlsmth · 2025-10-15T21:10:31Z

vllm/model_executor/models/nemotron_h.py

+                num_redundant_experts=self.n_redundant_experts,
+            )
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:


For DP+TP cases, we should use the sequence parallel trick like in #24982 to avoid duplicate work in the expert layers

Done. Thanks for the pointer :)

…nd returns True on is_mixture_of_experts(model) only if it actually has moe layers. Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 added 5 commits September 18, 2025 15:47

Add option for MoE layers in NemotronH, with non-gated MoE with squar…

14b2105

…edReLu activation - adapt the FusedMoE object to support is_act_and_mul=False Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Add support for non-gated moe in triton path for ModelOptFp8MoEMethod

e5ad365

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Merge branch 'main' into add-nemotronH-moe

1142be2

(1) fix weight_scale shape (2) avoid circular import

6b77e40

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Add is_act_and_mul to FusedMoEConfig instead of keeping it directly a…

7cb22e8

…s an attribute in FusedMoE Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 29, 2025 08:15

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

vllm/model_executor/models/nemotron_h.py Outdated Show resolved Hide resolved

tomeras91 commented Sep 29, 2025

View reviewed changes

tomeras91 added 4 commits October 8, 2025 19:02

Merge branch 'main' into add-nemotronH-moe

76a12cf

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

router logits and bias in FP32

d9258af

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

use SharedFusedMoE to overlap shared expert computation with routed e…

404c4a4

…xperts Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

fix ruff according to CI

7fff9a8

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 force-pushed the add-nemotronH-moe branch from bf2285e to 7fff9a8 Compare October 8, 2025 18:14

mergify bot added the needs-rebase label Oct 14, 2025

Merge branch 'main' into add-nemotronH-moe

f93c300

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

mergify bot removed the needs-rebase label Oct 15, 2025

fix import

0090a48

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tlrmchlsmth reviewed Oct 15, 2025

View reviewed changes

tomeras91 added 4 commits October 19, 2025 16:27

Merge remote-tracking branch 'origin/main' into add-nemotronH-moe

62dcba8

NemotronHForCausalLM now implements the MixtureOfExperts interface, a…

11626a4

…nd returns True on is_mixture_of_experts(model) only if it actually has moe layers. Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Apply TP Attn + EP MoE fix to avoid duplicate work

4e40727

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Merge branch 'main' into add-nemotronH-moe

56f05de

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 22, 2025

tlrmchlsmth approved these changes Oct 22, 2025

View reviewed changes

tlrmchlsmth enabled auto-merge (squash) October 22, 2025 22:15

Merge branch 'main' into add-nemotronH-moe

b8d1154

tomeras91 requested a review from pavanimajety as a code owner October 23, 2025 06:21

tlrmchlsmth merged commit 6108946 into vllm-project:main Oct 23, 2025
62 checks passed

tomeras91 deleted the add-nemotronH-moe branch October 23, 2025 12:50

usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025

[Model] Add MoE support for NemotronH (vllm-project#25863)

4845d82

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Model] Add MoE support for NemotronH (vllm-project#25863)

d275cdd

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Model] Add MoE support for NemotronH (vllm-project#25863)

bfc60cf

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Model] Add MoE support for NemotronH (vllm-project#25863)

b3e9665

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Uh oh!

Conversation

tomeras91 commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomeras91 commented Sep 29, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth Oct 15, 2025 •

edited

Loading