Skip to content

Support in-place update for trtllm_fp8_block_scale_moe#2739

Open
wzhao18 wants to merge 3 commits intoflashinfer-ai:mainfrom
wzhao18:wzhao/fix-trtllm-fp8-moe
Open

Support in-place update for trtllm_fp8_block_scale_moe#2739
wzhao18 wants to merge 3 commits intoflashinfer-ai:mainfrom
wzhao18:wzhao/fix-trtllm-fp8-moe

Conversation

@wzhao18
Copy link

@wzhao18 wzhao18 commented Mar 10, 2026

📌 Description

Fix #2703. The trtllm_fp8_block_scale_routed_moe function takes in output as input, but is not correctly writing to the buffer in-place when given. This PR fixes this.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features
    • MOE operations now accept optional pre-allocated output buffers; supplied buffers are validated (shape/dtype/device) and used in-place, otherwise a correctly-typed buffer is allocated automatically.
  • Tests
    • Tests updated to exercise in-place output buffering and subsequent post-processing of the buffer.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where the trtllm_fp8_block_scale_routed_moe function did not correctly perform in-place updates to its output buffer. The changes enable the function to accept and validate a pre-allocated output tensor, ensuring that computations are written directly to it. This enhancement improves memory management and correctness for scenarios requiring in-place operations within the fused MoE kernel.

Highlights

  • In-place Output Support: The trtllm_fp8_block_scale_routed_moe function now supports an optional output tensor for in-place updates, allowing the function to write results directly into a pre-allocated buffer.
  • Output Tensor Validation: Added validation for the provided output tensor's shape, data type, and device to ensure correctness when performing in-place updates.
  • Test Case Update: Modified the existing test case for trtllm_gen_fp8_routed_fused_moe to demonstrate and verify the new in-place update functionality by pre-allocating an output tensor.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/fused_moe/core.py
    • Modified the trtllm_fp8_block_scale_moe_op function to conditionally allocate the output tensor if not provided, or validate it if provided, enabling in-place updates.
  • tests/moe/test_trtllm_gen_routed_fused_moe.py
    • Updated the test_trtllm_gen_fp8_routed_fused_moe test to explicitly create an output tensor and pass it to the trtllm_fp8_block_scale_routed_moe function, verifying the in-place write behavior.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Updated trtllm_fp8_block_scale_moe_op to accept an optional output tensor and use it in-place after validating shape/dtype/device; when output is None the function allocates a BF16 output buffer. Tests adjusted to pass and validate a pre-allocated output buffer.

Changes

Cohort / File(s) Summary
MoE Core Operator
flashinfer/fused_moe/core.py
Signature changed: output: torch.Tensoroutput: Optional[torch.Tensor]. If output is None, allocate BF16 tensor (num_tokens, hidden_size) on hidden_states.device; if provided, validate via check_shape_dtype_device and use in-place. Workspace and routing control flow unchanged.
MoE Tests
tests/moe/test_trtllm_gen_routed_fused_moe.py
Test now pre-allocates a BF16 output buffer, passes it into the MoE call (in-place write), and converts the buffer to float for assertions instead of relying on the function to return a freshly allocated tensor.

Sequence Diagram(s)

(Skipped — changes are a targeted I/O behavior fix and do not introduce new multi-component control flow requiring a sequence diagram.)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

run-ci, op: moe, op: moe-routing

Suggested reviewers

  • yzh119
  • bkryu
  • jimmyzho
  • nv-yunzheq

Poem

A rabbit hops where buffers grow, 🐇
I pass my output, not create it so.
In-place we write, no stray allocation,
BF16 lined up in tidy formation,
Hooray for less memory in the flow!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding support for in-place updates to the trtllm_fp8_block_scale_moe function, which matches the core objective of the PR.
Linked Issues check ✅ Passed The PR successfully addresses issue #2703 by modifying trtllm_fp8_block_scale_moe_op to accept Optional[torch.Tensor] for output and validate/use provided buffers instead of always allocating new ones.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the in-place output buffer issue: core.py modifies the signature and logic, and test file validates the in-place behavior with an output buffer.
Description check ✅ Passed The PR description includes a clear issue reference (#2703), describes the fix applied, and confirms all pre-commit and test checklist items are complete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly implements in-place updates for trtllm_fp8_block_scale_routed_moe by handling an optional output buffer. The changes ensure that if an output tensor is provided, it is used for the operation, and if not, a new one is allocated. The tests have also been updated to verify this new in-place functionality. I have one suggestion to improve type hint consistency for better code clarity.

Copy link
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aleozlx
Copy link
Collaborator

aleozlx commented Mar 19, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !433 has been created, and the CI pipeline #46551641 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[SUCCESS] Pipeline #46551641: 14/20 passed

@wzhao18
Copy link
Author

wzhao18 commented Mar 20, 2026

@aleozlx CI seems passing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] trtllm_fp8_block_scale_moe_op output is not updated in-place

3 participants