Skip to content

Bugfix: Pass router logits dtype in nemotron shared experts#32669

Merged
tlrmchlsmth merged 2 commits intovllm-project:mainfrom
amirkl94:bugfix/nemotron-router-dtype
Jan 29, 2026
Merged

Bugfix: Pass router logits dtype in nemotron shared experts#32669
tlrmchlsmth merged 2 commits intovllm-project:mainfrom
amirkl94:bugfix/nemotron-router-dtype

Conversation

@amirkl94
Copy link
Copy Markdown
Contributor

@amirkl94 amirkl94 commented Jan 20, 2026

Purpose

A change introduced in this PR , requires passing router_logits_dtype to MoE layer.

When running with dp > 1 and flashinfer cutlass MoE kernel in nvfp4, the following error happens:

assert self.batched_router_logits.dtype == full_router_logits.dtype, (
ERROR 01-19 05:53:49 [multiproc_executor.py:839]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-19 05:53:49 [multiproc_executor.py:839] AssertionError: torch.bfloat16 == torch.float32

Test Result

Verified no error happens when running with dp > 1 and flashinfer cutlass MoE kernel.

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
@mergify mergify bot added the bug Something isn't working label Jan 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a dtype mismatch assertion in the MoE layer when data parallelism is enabled. The change introduces a router_logits_dtype variable, consistently using torch.float32 for the gating layer and the MoE implementation. This resolves the bug by ensuring dtype consistency for router logits. The fix is minimal, targeted, and well-implemented.

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 26, 2026
Copy link
Copy Markdown
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) January 26, 2026 16:00
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

rebased for CI

@tlrmchlsmth tlrmchlsmth merged commit e01ff5c into vllm-project:main Jan 29, 2026
53 of 54 checks passed
apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026
…ject#32669)

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…ject#32669)

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Signed-off-by: PiratePai <416932041@qq.com>
Signed-off-by: Pai <416932041@qq.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…ject#32669)

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants