feat: Support unpadded output hidden size for trtllm_fp4_block_scale_moe#2217
Conversation
WalkthroughThe changes modify the FP4 block-scale MoE kernel and Python wrapper to derive output dimensions directly from the actual output tensor shape rather than internal assumptions. Validation checks are added to pre-allocated output tensors to enforce shape and dtype consistency. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10–15 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @elvischenv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for unpadded output hidden sizes in trtllm_fp4_block_scale_moe, which is a valuable optimization to avoid unnecessary slicing operations. The changes made to the C++ kernel launcher and the Python operator are well-implemented and align with the stated goal. I've identified a minor issue in the corresponding "fake" operator implementation, which could result in incorrect behavior under specific conditions. My detailed feedback is provided in the review comments.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
flashinfer/fused_moe/core.py (1)
1756-1765: Consider additional validation for edge cases.The validation logic correctly ensures the output tensor has the right dtype, device, and shape constraints. However, consider adding checks for edge cases:
Dimension check: Verify
output.ndim == 2before accessingoutput.shape[0]andoutput.shape[1]to provide a clearer error message if the wrong dimensionality is provided.Minimum size check: Consider validating that
output.shape[1] > 0to catch cases where a zero-sized output dimension is provided, which may not be meaningful for MoE operations.Apply this diff to add dimension validation:
else: check_shape_dtype_device( output, None, torch.bfloat16, hidden_states.device, "output" ) + assert output.ndim == 2, ( + f"output must be 2D, got {output.ndim}D" + ) assert output.shape[0] == num_tokens, ( f"output.shape[0]={output.shape[0]} must be equal to {num_tokens}" ) assert output.shape[1] <= hidden_size, ( f"output.shape[1]={output.shape[1]} must be less than or equal to {hidden_size}" ) + assert output.shape[1] > 0, ( + f"output.shape[1]={output.shape[1]} must be greater than 0" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
csrc/trtllm_fused_moe_kernel_launcher.cu(1 hunks)flashinfer/fused_moe/core.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
flashinfer/fused_moe/core.py (1)
flashinfer/utils.py (1)
check_shape_dtype_device(565-583)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (2)
flashinfer/fused_moe/core.py (1)
1912-1912: LGTM! Fake implementation now consistent with actual behavior.This change correctly derives
hidden_sizefrom the provided output tensor when available, ensuring the fake implementation (used for tracing/compilation) matches the actual kernel behavior. This maintains consistency with the C++ changes that useoutput.size(1)forhidden_size_output.csrc/trtllm_fused_moe_kernel_launcher.cu (1)
1748-1748: The kernel implementation correctly respects thehidden_size_outputparameter. The finalize kernel useshiddenDim(set fromhidden_size_output) for output buffer calculations and loop bounds, separate fromhiddenDimPadded(the internal padded dimension). Output writes are bounded bynumElemsInCol = params.hiddenDim / FINALIZE_ELEM_PER_THREAD, ensuring no buffer overflows occur whenoutput.size(1) < hidden_size.
|
/bot run |
…e_output_hidden_size
|
/bot run |
| ): | ||
| seq_len = hidden_states.shape[0] | ||
| hidden_size = hidden_states.shape[1] | ||
| hidden_size = hidden_states.shape[1] if output is None else output.shape[1] |
There was a problem hiding this comment.
Hi @elvischenv just want to make sure my understanding is right, does this mean hidden_states.shape[1] is the effective hidden dimension where output.shape[1] could be the padded hidden dimension?
There was a problem hiding this comment.
I don't think so. hidden_states.shape[1] is already padded, the input should be padded before the MoE kernel. Before this PR, the API does not have info of the original unpadded hidden dim, the output will be in the same shape with hidden_states. We have to slice the output to original unpadded hidden size, which involves extra overhead.
This PR allows us to pass output with unpadded hidden dim, then the kernel will write the results in such dim by setting args->hidden_size_output = output.size(1);. Then we won't need the slice anymore.
There was a problem hiding this comment.
Thank you for the explanation!
| assert output.shape[0] == num_tokens, ( | ||
| f"output.shape[0]={output.shape[0]} must be equal to {num_tokens}" | ||
| ) | ||
| assert output.shape[1] <= hidden_size, ( | ||
| f"output.shape[1]={output.shape[1]} must be less than or equal to {hidden_size}" | ||
| ) |
There was a problem hiding this comment.
We do this assertion since the output could be unpadded so that smaller than the (padded) hidden size.
|
[CANCELING] Pipeline #40207125: canceled |
yzh119
left a comment
There was a problem hiding this comment.
Failed UT are not relevant and let's merge this first.
…moe (flashinfer-ai#2217) <!-- .github/pull_request_template.md --> ## 📌 Description Support unpadded output hidden size for `trtllm_fp4_block_scale_moe` by passing the `output` tensor with unpadded hidden size, so that we don't need to do slicing after MoE. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Corrected output dimension handling in MoE operations to use actual output tensor shapes rather than assumed values. * Enhanced validation for pre-allocated output tensors with stricter shape and dtype checks. * Improved consistency between different code execution paths for MoE operations. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: yzh119 <zihaoy@nvidia.com>
📌 Description
Support unpadded output hidden size for
trtllm_fp4_block_scale_moeby passing theoutputtensor with unpadded hidden size, so that we don't need to do slicing after MoE.🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.