[Model] Add MoE support for NemotronH#25863
Conversation
…edReLu activation - adapt the FusedMoE object to support is_act_and_mul=False Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…s an attribute in FusedMoE Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for a non-gated Squared ReLU MoE module in the NemotronH architecture, which is a valuable enhancement. The changes are mostly well-implemented across the fused MoE layers and model definition. However, I've identified a critical bug in the forward pass of the new NemotronHMoE module related to incorrect floating-point computation and a potential UnboundLocalError. I've provided a detailed comment with a suggested fix for this issue. Addressing this is crucial for the correctness of the model's output.
There was a problem hiding this comment.
To the reviewer(s)
NemotronHForCausalLM now optionally has an MoE block. I was wondering if it should implement the MixtureOfExperts interface or not. Do you have any guidance?
There was a problem hiding this comment.
We might need to something similar to this PR #25311 (comment), where is_mixture_of_experts depends on an attribute of the model. I don't know all the cases where this is used though
There was a problem hiding this comment.
Thanks for the input. Done
RE your comment in the other PR - I think that checking whether getattr(model, "num_moe_layers", 0) > 0 in is_mixture_of_experts makes sense, since all models implementing MixtureOfExperts are expected to initialize num_moe_layers as it is part of the interface. So I don't think it is too fragile.
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…xperts Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
bf2285e to
7fff9a8
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
| if not self.moe_config.is_act_and_mul: | ||
| # Avoid circular import | ||
| from vllm.model_executor.layers.quantization.modelopt import ( | ||
| ModelOptFp8MoEMethod, | ||
| ) | ||
|
|
||
| if not isinstance( | ||
| quant_method, (UnquantizedFusedMoEMethod, ModelOptFp8MoEMethod) | ||
| ): | ||
| raise NotImplementedError( | ||
| "is_act_and_mul=False is supported only for unquantized " | ||
| "and ModelOpt FP8 moe for now" | ||
| ) | ||
| if not current_platform.is_cuda(): | ||
| raise NotImplementedError( | ||
| "is_act_and_mul=False is supported only for CUDA for now" | ||
| ) |
There was a problem hiding this comment.
What are the blockers for supporting is_act_and_mul = False more generally?
There was a problem hiding this comment.
Creating the relevant kernels :) We plan to follow up with that
| if ( | ||
| envs.VLLM_USE_FLASHINFER_MOE_FP8 | ||
| and has_flashinfer_moe() | ||
| and self.moe.is_act_and_mul | ||
| ): |
There was a problem hiding this comment.
For NemotronH, self.flashinfer_moe_backend will end up being None. What implementation ends up getting used in this case?
There was a problem hiding this comment.
triton kernels. This is currently the only code path available with is_act_and_mul=False
There was a problem hiding this comment.
I suspect this is going to be very complicated to add to all the quant and kernel backends
There was a problem hiding this comment.
Agreed. We can follow-up on this discussion internally
There was a problem hiding this comment.
We might need to something similar to this PR #25311 (comment), where is_mixture_of_experts depends on an attribute of the model. I don't know all the cases where this is used though
| num_redundant_experts=self.n_redundant_experts, | ||
| ) | ||
|
|
||
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
For DP+TP cases, we should use the sequence parallel trick like in #24982 to avoid duplicate work in the expert layers
There was a problem hiding this comment.
Done. Thanks for the pointer :)
…nd returns True on is_mixture_of_experts(model) only if it actually has moe layers. Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Purpose
Add support for an MoE module in the NemotronH architecture.
This MoE module is relatively unique (to the best of my knowledge, comparable only to nomic-ai/nomic-embed-text-v2-moe), as it uses a non-gated Squared ReLU activation function.
In this PR:
NemotronHMoEmodule to the NemotronH modeling fileFusedMoEclass (in addition to by calling thefused_moefunction directly)ModelOptFp8MoEMethodquant_method, currently only in the triton path