fix: snap weight_scale_vec_size to handle block_scale_interleave padding for SM120#2898
fix: snap weight_scale_vec_size to handle block_scale_interleave padding for SM120#2898samuellees wants to merge 7 commits intoflashinfer-ai:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical compatibility issue affecting NVIDIA SM120 GPUs when executing MXFP4 Mixture-of-Experts (MoE) CUTLASS kernels. The fix addresses an erroneous calculation of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdjusted FP4/MXFP MoE scale-vector sizing in the TRT-LLM CUDA launcher to compute a raw vec size, snap it to 16 or 32, validate against padded host scale tensors, added a CUDA-only regression test for an unaligned hidden size, and made FP4 scale-factor reshaping padding-aware. Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request fixes a bug (issue #2847) where the weight_scale_vec_size was incorrectly calculated in fused MoE kernels when the hidden_size was not aligned to sf_block_size * 4. The fix in csrc/trtllm_fused_moe_kernel_launcher.cu introduces a more robust calculation for weight_scale_vec_size, snapping it to the nearest valid value (16 or 32) and adding a validation check for the scale tensor's numel(). To verify this, two new tests, test_moe_nvfp4_unaligned_hidden_size and test_moe_mxfp8_mxfp4_unaligned_hidden_size, have been added to tests/moe/test_trtllm_cutlass_fused_moe.py to specifically target and confirm the correction under these unaligned conditions. There are no review comments to provide feedback on.
|
/bot run |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/moe/test_trtllm_cutlass_fused_moe.py (1)
1848-1848: Optional: Replace lambda with def to satisfy Ruff E731.Static analysis flags the lambda assignment. While this is consistent with line 504 in the same file, consider using a local function for clarity.
Suggested fix
- round_up = lambda x, y: (x + y - 1) // y * y + def round_up(x, y): + return (x + y - 1) // y * y🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_trtllm_cutlass_fused_moe.py` at line 1848, The inline lambda assigned to round_up triggers Ruff E731; replace the lambda with a local function definition (e.g., def round_up(x, y): ...) to satisfy the linter and improve clarity—update any subsequent references to round_up unchanged and keep the same behavior ((x + y - 1) // y * y) inside the new function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Line 1848: The inline lambda assigned to round_up triggers Ruff E731; replace
the lambda with a local function definition (e.g., def round_up(x, y): ...) to
satisfy the linter and improve clarity—update any subsequent references to
round_up unchanged and keep the same behavior ((x + y - 1) // y * y) inside the
new function.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a03b96fe-45b6-49ec-926e-efd493379b15
📒 Files selected for processing (2)
csrc/trtllm_fused_moe_kernel_launcher.cutests/moe/test_trtllm_cutlass_fused_moe.py
|
[FAILED] Pipeline #47049224: 8/20 passed |
|
/bot run |
|
[FAILED] Pipeline #47074816: 7/20 passed |
|
/bot run |
|
[FAILED] Pipeline #47192512: 8/20 passed |
…ing (flashinfer-ai#2847) block_scale_interleave pads scale columns to a multiple of 4 via round_up(cols, 4), which inflates gemm1_weights_scale.numel(). When hidden_size / sf_block_size is not a multiple of 4 (e.g. gpt-oss-120b with hidden_size=2880, sf_block_size=32 → 90 scale cols padded to 92), the reverse-computed weight_scale_vec_size becomes 31 instead of 32, failing the strict equality check and blocking SM120 MXFP4 MoE kernels. Replace the hard-coded equality check with snap-to-nearest-valid (16 or 32) plus a round-trip validation ensuring the actual scale tensor is at least as large as the unpadded expectation. Add regression tests with hidden_size=2880 for both NVFP4 and MXFP8xMXFP4 MoE paths. Closes flashinfer-ai#2847
…e padding block_scale_interleave pads scale columns to a multiple of 4. For hidden_size=2880 (sf_cols=90, padded to 92), the kernel returns sf.numel()=rows*92, and sf.reshape((-1, 90)) fails because 92*rows is not divisible by 90. Fix: swap to sf.reshape((input.shape[-2], -1)) so the row count is fixed and the column count is inferred from the actual (possibly padded) tensor size. AI-assisted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
block_scale_interleave pads in both dimensions: - K/sf_vec_size is rounded up to the next multiple of 4 (scale cols) - M is rounded up to the next multiple of 128 (scale rows) The previous fix sf.reshape((input.shape[-2], -1)) handles padded columns but fails when M is not a multiple of 128 (e.g. M=48 → padded to 128 by kernel, but 1024/48 is not an integer). Fix: use sf_cols = round_up(K//sf_vec_size, 4) as the fixed column dimension and let -1 infer the (possibly padded) row count. This handles both unaligned K (e.g. hidden_size=2880: 90→92 cols) and unaligned M (e.g. 48 rows → 128 in sf). AI-assisted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
06de617 to
2ed47f5
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/moe/test_trtllm_cutlass_fused_moe.py (1)
1972-1972: Consider usingdefinstead of lambda for consistency with linting rules.Static analysis flags this lambda assignment. However, this pattern is already used elsewhere in the file (line 505), so keeping it consistent with the existing codebase is also reasonable.
♻️ Optional: Replace lambda with def
- round_up = lambda x, y: (x + y - 1) // y * y + def round_up(x, y): + return (x + y - 1) // y * y🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_trtllm_cutlass_fused_moe.py` at line 1972, The lambda assigned to round_up should be replaced with a regular function to satisfy linting and consistency: implement def round_up(x, y): return (x + y - 1) // y * y and keep all existing call sites unchanged; ensure the new function name round_up is used in place of the lambda and no other behavior changes are introduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Line 1972: The lambda assigned to round_up should be replaced with a regular
function to satisfy linting and consistency: implement def round_up(x, y):
return (x + y - 1) // y * y and keep all existing call sites unchanged; ensure
the new function name round_up is used in place of the lambda and no other
behavior changes are introduced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7663e963-68b1-4fe5-9588-c0c7a153fe6f
📒 Files selected for processing (3)
csrc/trtllm_fused_moe_kernel_launcher.cuflashinfer/quantization/fp4_quantization.pytests/moe/test_trtllm_cutlass_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/quantization/fp4_quantization.py
|
/bot run |
|
[FAILED] Pipeline #47256853: 7/20 passed |
…ble MXFP4 test The sf reshape fix must be conditional on is_sf_swizzled_layout: - Swizzled: sf includes row/column padding from block_scale_interleave, so use round_up(K/sf_vec_size, 4) as column count - Non-swizzled: sf has exactly M * (K/sf_vec_size) elements, no padding, so use K/sf_vec_size directly (original behavior) Remove test_moe_mxfp8_mxfp4_unaligned_hidden_size because the MXFP4 MoE kernel requires hidden_size % 128 == 0, and when that constraint holds, hidden_size/32 is always a multiple of 4, making the column-padding scenario impossible. AI-assisted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/moe/test_trtllm_cutlass_fused_moe.py (1)
2084-2087: Avoid ambiguous Unicode multiplication sign in comments.Replace
×withxto satisfyRUF003and avoid encoding/search inconsistencies in tooling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_trtllm_cutlass_fused_moe.py` around lines 2084 - 2087, The comment string containing "MXFP8×MXFP4" uses a Unicode multiplication sign which triggers RUF003; edit the comment that starts "NOTE: No MXFP8×MXFP4 unaligned-hidden_size test here..." (reference the MXFP8×MXFP4 text) and replace the Unicode × with a plain ASCII 'x' so it reads "MXFP8xMXFP4", ensuring the rest of the comment remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 1944-1947: Replace the direct torch.cuda.get_device_capability()
check in the pytest.mark.skipif with the flashinfer.utils capability helpers:
import get_compute_capability and/or is_sm100a_supported from flashinfer.utils
and use those to determine support for NVFP4 (e.g., use is_sm100a_supported() or
compare get_compute_capability() via that helper) in the skipif predicate
instead of torch.cuda.get_device_capability(); update the skip reason text if
needed to reflect the helper usage and keep the skip guard consistent with other
tests.
- Line 1972: Replace the lambda assignment "round_up = lambda x, y: (x + y - 1)
// y * y" with a local named function to satisfy Ruff E731: define a function
named round_up with parameters x and y that returns (x + y - 1) // y * y, and
update any usages to call round_up(x, y); reference the existing symbol
"round_up" and ensure the function is defined in the same scope where the lambda
was previously assigned.
---
Nitpick comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 2084-2087: The comment string containing "MXFP8×MXFP4" uses a
Unicode multiplication sign which triggers RUF003; edit the comment that starts
"NOTE: No MXFP8×MXFP4 unaligned-hidden_size test here..." (reference the
MXFP8×MXFP4 text) and replace the Unicode × with a plain ASCII 'x' so it reads
"MXFP8xMXFP4", ensuring the rest of the comment remains unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 58ed0de1-db3d-4239-81de-ff5da2658ea1
📒 Files selected for processing (2)
flashinfer/quantization/fp4_quantization.pytests/moe/test_trtllm_cutlass_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/quantization/fp4_quantization.py
|
/bot run |
- Use flashinfer.utils helpers (is_sm100a_supported, is_sm12x_supported) instead of torch.cuda.get_device_capability() in skipif - Replace lambda with def for round_up to satisfy Ruff E731 - Replace Unicode multiplication sign with ASCII 'x' (RUF003) AI-assisted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/moe/test_trtllm_cutlass_fused_moe.py`:
- Around line 1971-1975: Pre-commit formatting failed for this test file; run
the project's formatter and commit only the formatting changes: run `pre-commit
run --all-files` (or `ruff format`), which will reformat the block containing
torch.manual_seed(42), quant_blocksize = 16, the round_up(...) function, and e =
num_experts, then stage and commit the resulting formatting-only diff.
- Line 1972: The test uses quant_blocksize = 16 which with hidden_size = 2880
yields k//quant_blocksize = 180 and no padding, so the padded-scale-columns path
isn't exercised; change quant_blocksize (or hidden_size) in the test to a value
that forces non-multiple-of-4 blocks (e.g., quant_blocksize = 32 so
k//quant_blocksize becomes 90 and round_up(90, 4) -> 92) to hit the padding
codepath referenced in the docstring; update all related occurrences (the
variables quant_blocksize, the calculation using round_up, and any assertions in
the same test function) so the test actually validates the padded columns
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7e91d0da-a2f7-4c95-8242-606e92c38758
📒 Files selected for processing (1)
tests/moe/test_trtllm_cutlass_fused_moe.py
| torch.manual_seed(42) | ||
| quant_blocksize = 16 | ||
| def round_up(x, y): | ||
| return (x + y - 1) // y * y | ||
| e = num_experts |
There was a problem hiding this comment.
Pre-commit formatting is still failing on this file.
CI reports ruff-format changed this file. Please run pre-commit run --all-files (or ruff format) and commit the formatting-only delta.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/moe/test_trtllm_cutlass_fused_moe.py` around lines 1971 - 1975,
Pre-commit formatting failed for this test file; run the project's formatter and
commit only the formatting changes: run `pre-commit run --all-files` (or `ruff
format`), which will reformat the block containing torch.manual_seed(42),
quant_blocksize = 16, the round_up(...) function, and e = num_experts, then
stage and commit the resulting formatting-only diff.
|
[FAILED] Pipeline #47349540: 6/20 passed |
hidden_size=2880 with quant_blocksize=16 gives 2880/16=180 scale cols, which is already divisible by 4 -- no padding occurs, so the test did not exercise the weight_scale_vec_size snap fix. Change to hidden_size=288: 288/16=18 cols, round_up(18,4)=20, which triggers the block_scale_interleave padding path. Also fix ruff-format: add blank lines around nested def. AI-assisted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/bot run |
|
[FAILED] Pipeline #47443267: 5/20 passed |
Fixes #2847
SM120 GPUs (RTX PRO 6000, RTX 5090) cannot use MXFP4 MoE CUTLASS kernels because the
weight_scale_vec_sizecheck intrtllm_fp4_block_scale_moerejects valid inputs.Root cause:
block_scale_interleavepads scale columns viaround_up(cols, 4), inflatinggemm1_weights_scale.numel(). Whenhidden_size / sf_block_sizeis not a multiple of 4 (e.g. gpt-oss-120b:hidden_size=2880, sf_block_size=32 → 90 scale cols padded to 92), the reverse-computed
weight_scale_vec_sizebecomes 31instead of 32, failing the strict equality check.
Fix: Replace the hard-coded equality check with snap-to-nearest-valid (16 or 32) plus a round-trip validation
ensuring the actual scale tensor numel is at least as large as the unpadded expectation.
Changes
csrc/trtllm_fused_moe_kernel_launcher.cu: Replaceweight_scale_vec_sizecheck logic intrtllm_fp4_block_scale_moetests/moe/test_trtllm_cutlass_fused_moe.py: Addtest_moe_nvfp4_unaligned_hidden_sizeandtest_moe_mxfp8_mxfp4_unaligned_hidden_sizewith hidden_size=2880 to cover the padding scenarioTest plan
pytest tests/moe/test_trtllm_cutlass_fused_moe.py -xSummary by CodeRabbit
Bug Fixes
Tests