Support in-place update for `trtllm_fp8_block_scale_moe` by wzhao18 · Pull Request #2739 · flashinfer-ai/flashinfer

wzhao18 · 2026-03-10T04:39:08Z

📌 Description

Fix #2703. The trtllm_fp8_block_scale_routed_moe function takes in output as input, but is not correctly writing to the buffer in-place when given. This PR fixes this.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- MOE operations now accept optional pre-allocated output buffers; supplied buffers are validated (shape/dtype/device) and used in-place, otherwise a correctly-typed buffer is allocated automatically.
Tests
- Tests updated to exercise in-place output buffering and subsequent post-processing of the buffer.

gemini-code-assist · 2026-03-10T04:39:21Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where the trtllm_fp8_block_scale_routed_moe function did not correctly perform in-place updates to its output buffer. The changes enable the function to accept and validate a pre-allocated output tensor, ensuring that computations are written directly to it. This enhancement improves memory management and correctness for scenarios requiring in-place operations within the fused MoE kernel.

Highlights

In-place Output Support: The trtllm_fp8_block_scale_routed_moe function now supports an optional output tensor for in-place updates, allowing the function to write results directly into a pre-allocated buffer.
Output Tensor Validation: Added validation for the provided output tensor's shape, data type, and device to ensure correctness when performing in-place updates.
Test Case Update: Modified the existing test case for trtllm_gen_fp8_routed_fused_moe to demonstrate and verify the new in-place update functionality by pre-allocating an output tensor.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/fused_moe/core.py
- Modified the trtllm_fp8_block_scale_moe_op function to conditionally allocate the output tensor if not provided, or validate it if provided, enabling in-place updates.
tests/moe/test_trtllm_gen_routed_fused_moe.py
- Updated the test_trtllm_gen_fp8_routed_fused_moe test to explicitly create an output tensor and pass it to the trtllm_fp8_block_scale_routed_moe function, verifying the in-place write behavior.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-10T04:39:27Z

📝 Walkthrough

Walkthrough

Updated trtllm_fp8_block_scale_moe_op to accept an optional output tensor and use it in-place after validating shape/dtype/device; when output is None the function allocates a BF16 output buffer. Tests adjusted to pass and validate a pre-allocated output buffer.

Changes

Cohort / File(s)	Summary
MoE Core Operator `flashinfer/fused_moe/core.py`	Signature changed: `output: torch.Tensor` → `output: Optional[torch.Tensor]`. If `output` is None, allocate BF16 tensor `(num_tokens, hidden_size)` on `hidden_states.device`; if provided, validate via `check_shape_dtype_device` and use in-place. Workspace and routing control flow unchanged.
MoE Tests `tests/moe/test_trtllm_gen_routed_fused_moe.py`	Test now pre-allocates a BF16 `output` buffer, passes it into the MoE call (in-place write), and converts the buffer to float for assertions instead of relying on the function to return a freshly allocated tensor.

Sequence Diagram(s)

(Skipped — changes are a targeted I/O behavior fix and do not introduce new multi-component control flow requiring a sequence diagram.)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

chore/feat: Add do_finalize to trtllm-gen fp8/f16 MoE APIs #2548: Modifies trtllm_fp8_block_scale_moe calling/signature and output-buffer handling to support optional/in-out output buffers.
feat: Support unpadded output hidden size for trtllm_fp4_block_scale_moe #2217: Adds support for user-provided output tensors and related validation in fused MoE operators.
chore/feat: A2A + MoE benchmark; add routed counterpart for trtllm_gen_fp8_fused_moe #2379: Updates the same FP8 block-scale MoE core API signatures and I/O handling.

Suggested labels

run-ci, op: moe, op: moe-routing

Suggested reviewers

yzh119
bkryu
jimmyzho
nv-yunzheq

Poem

A rabbit hops where buffers grow, 🐇
I pass my output, not create it so.
In-place we write, no stray allocation,
BF16 lined up in tidy formation,
Hooray for less memory in the flow!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding support for in-place updates to the trtllm_fp8_block_scale_moe function, which matches the core objective of the PR.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#2703` by modifying trtllm_fp8_block_scale_moe_op to accept Optional[torch.Tensor] for output and validate/use provided buffers instead of always allocating new ones.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the in-place output buffer issue: core.py modifies the signature and logic, and test file validates the in-place behavior with an output buffer.
Description check	✅ Passed	The PR description includes a clear issue reference (`#2703`), describes the fix applied, and confirms all pre-commit and test checklist items are complete.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

gemini-code-assist

Code Review

This pull request correctly implements in-place updates for trtllm_fp8_block_scale_routed_moe by handling an optional output buffer. The changes ensure that if an output tensor is provided, it is used for the operation, and if not, a new one is allocated. The tests have also been updated to verify this new in-place functionality. I have one suggestion to improve type hint consistency for better code clarity.

flashinfer/fused_moe/core.py

aleozlx

lgtm

aleozlx · 2026-03-19T20:34:03Z

/bot run

flashinfer-bot · 2026-03-19T20:34:44Z

GitLab MR !433 has been created, and the CI pipeline #46551641 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-20T00:43:53Z

[SUCCESS] Pipeline #46551641: 14/20 passed

wzhao18 · 2026-03-20T15:54:13Z

@aleozlx CI seems passing?

wzhao18 added 2 commits March 10, 2026 01:34

Update output in-place for trtllm_fp8_block_scale_routed_moe

4aae3e3

update

6a7d2f9

wzhao18 requested review from IwakuraRein, bkryu, jimmyzho, nv-yunzheq and yzh119 as code owners March 10, 2026 04:39

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

flashinfer/fused_moe/core.py Show resolved Hide resolved

Change output parameter to Optional in core.py

34104c6

aleozlx added the op: moe label Mar 19, 2026

aleozlx approved these changes Mar 19, 2026

View reviewed changes

Conversation

wzhao18 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx commented Mar 19, 2026

Uh oh!

flashinfer-bot commented Mar 19, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

wzhao18 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wzhao18 commented Mar 10, 2026 •

edited

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading