Skip to content

Tiny fix trtllm_fp8_per_tensor_scale_moe_wrapper router_logits dtype#22006

Merged
Qiaolin-Yu merged 2 commits intomainfrom
fix_per_tensor
Apr 6, 2026
Merged

Tiny fix trtllm_fp8_per_tensor_scale_moe_wrapper router_logits dtype#22006
Qiaolin-Yu merged 2 commits intomainfrom
fix_per_tensor

Conversation

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

Motivation

https://github.com/flashinfer-ai/flashinfer/blob/fe0539318dcc31c76a33a7ed2ab0ee3c94fe6bad/csrc/trtllm_fused_moe_kernel_launcher.cu#L1789

the dtype of router_logits should be float32 for deepseek routing method

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Apr 3, 2026
@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Apr 4, 2026

Btw, This bug happen before (#14350 also maybe one more instance but I can't find it...)

@b8zhong b8zhong enabled auto-merge (squash) April 4, 2026 00:16
@Qiaolin-Yu Qiaolin-Yu disabled auto-merge April 6, 2026 04:11
@Qiaolin-Yu Qiaolin-Yu merged commit f407461 into main Apr 6, 2026
231 of 267 checks passed
@Qiaolin-Yu Qiaolin-Yu deleted the fix_per_tensor branch April 6, 2026 04:11
# during torch.compile for piecewise cuda graph.
# Use custom op wrapper for torch.compile compatibility.

# The DeepSeekV3 routing method requires float32 router logits.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leejnau @trevor-m is this true? If so, why didn't we run into issues before?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe will be fixed by flashinfer-ai/flashinfer#2993 ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path for block scale had this fix already, I think we didn't use per tensor scaling before?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. we have never run DSV3/R1 with per-tensor FP8 before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants