Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaced manual padding with consistent Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
c23f79d to
ce48d4f
Compare
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical padding issue within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to add necessary padding for linear scale factors in FP4 quantization, a change required for certain hardware paths and crucial for preventing potential buffer overflows in the underlying CUDA kernels. However, the current implementation introduces a critical NameError where the padding variable out_sf_size_padded is not defined in all code paths, specifically when using the default swizzled layout. This flaw will cause the function to crash in most scenarios, leading to a Denial of Service. Please address this NameError to ensure the stability and security of the FP4 quantization process.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
flashinfer/fp4_quantization.py (1)
215-230: Add a regression test for the linear-layout padding path.This fix changes only the backing allocation while still returning the logical
out_sfview, so it can regress silently. The existing FP4 tests shown here cover the swizzled reshape/unswizzle flow, notfp4_quantize(..., is_sf_swizzled_layout=False). Please add a case with a non-16-alignedmthat is large enough to take the TMA path (per the PR notes,m >= 1024on SM100) so this padding requirement stays locked in.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fp4_quantization.py` around lines 215 - 230, The tests miss exercising the linear-layout padding path in fp4_quantize, so add a regression test that calls module.fp4_quantize (or the public wrapper that returns out_val, out_sf) with is_sf_swizzled_layout=False and a non-16-aligned m large enough to trigger the TMA path (e.g., m >= 1024 on SM100) to ensure out_sf is a view of a larger out_sf backing allocation (created via out_sf_size_padded) but returned truncated to out_sf_size; the test should verify correct values and that out_sf.shape equals the logical size (out_sf_size) and that no indexing/overflow occurs when m is not 16-aligned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@flashinfer/fp4_quantization.py`:
- Around line 215-230: The tests miss exercising the linear-layout padding path
in fp4_quantize, so add a regression test that calls module.fp4_quantize (or the
public wrapper that returns out_val, out_sf) with is_sf_swizzled_layout=False
and a non-16-aligned m large enough to trigger the TMA path (e.g., m >= 1024 on
SM100) to ensure out_sf is a view of a larger out_sf backing allocation (created
via out_sf_size_padded) but returned truncated to out_sf_size; the test should
verify correct values and that out_sf.shape equals the logical size
(out_sf_size) and that no indexing/overflow occurs when m is not 16-aligned.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1568cdf0-020d-4ee9-8873-becf76a93b93
📒 Files selected for processing (1)
flashinfer/fp4_quantization.py
|
@rainj-me The unit test will be much more reliable when ran with |
Probably reduce from 8192 to 4096 or sth. But please add to ut, in case there is further break. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/fp4_quantization.py (1)
233-245:⚠️ Potential issue | 🔴 CriticalFix fake op signature and output dtypes to match real op.
The fake op
_fake_fp4_quantize_sm100has multiple critical issues that break torch.compile shape/dtype inference:
- Missing parameters: Real op has
is_sf_8x4_layoutandenable_pdlparameters that fake op lacks. Signatures must exactly mirror.- Wrong return dtypes: Real op returns
(torch.uint8, torch.uint8), but fake op returns(torch.int64, torch.int32). torch.compile depends on fake ops for dtype inference.- Incorrect scale factor shape: When
is_sf_swizzled_layout=True(the default), real op uses_compute_swizzled_layout_sf_size()which pads the output significantly. Fake op ignores this and always returns[m * k // sf_vec_size], causing shape mismatches during tracing.Add missing parameters to the fake op signature, correct the dtypes, and compute the proper swizzled layout size when needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fp4_quantization.py` around lines 233 - 245, The fake op _fake_fp4_quantize_sm100 must match the real op signature and outputs: add the missing parameters is_sf_8x4_layout and enable_pdl to the function signature, change the returned dtypes to torch.uint8 for both tensors, and compute the scale-factor length using _compute_swizzled_layout_sf_size(m, k, sf_vec_size) when is_sf_swizzled_layout is True (otherwise use m * k // sf_vec_size). Ensure the first return has shape [m, k // 2] dtype=torch.uint8 and the second return uses the computed swizzled/un-swizzled length dtype=torch.uint8.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@flashinfer/fp4_quantization.py`:
- Around line 233-245: The fake op _fake_fp4_quantize_sm100 must match the real
op signature and outputs: add the missing parameters is_sf_8x4_layout and
enable_pdl to the function signature, change the returned dtypes to torch.uint8
for both tensors, and compute the scale-factor length using
_compute_swizzled_layout_sf_size(m, k, sf_vec_size) when is_sf_swizzled_layout
is True (otherwise use m * k // sf_vec_size). Ensure the first return has shape
[m, k // 2] dtype=torch.uint8 and the second return uses the computed
swizzled/un-swizzled length dtype=torch.uint8.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 20d8cdf7-6ae7-427e-bbcd-2e54ad16bdf5
📒 Files selected for processing (1)
flashinfer/fp4_quantization.py
|
/bot run |
|
[FAILED] Pipeline #45817934: 7/20 passed |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/utils/test_fp4_quantize_padding.py`:
- Around line 52-59: Remove the global torch.set_default_device(device) call to
avoid leaking global state; instead, create tensors with explicit device
arguments (e.g., change x = torch.randn((m, n), dtype=dtype) to pass
device=device). Ensure any other tensor factories in this test use device=device
as well; keep torch.manual_seed(seed) as is. This targets the
torch.set_default_device symbol and the torch.randn creation in the test.
- Around line 1-5: The module-level os.environ["PYTORCH_NO_CUDA_MEMORY_CACHING"]
assignment should be removed (set this in the test command/CI instead), and the
test's mutation of global device state via torch.set_default_device(device) must
be wrapped so the original default device is restored; capture the current
default device before calling torch.set_default_device(device) and use a
try/finally to call torch.set_default_device(original_device) in the finally
block (follow the pattern used in tests/utils/test_sampling.py), ensuring no
global torch state is left changed after the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0bc45642-c173-4005-a510-34954c3fa288
📒 Files selected for processing (1)
tests/utils/test_fp4_quantize_padding.py
|
/bot run |
|
[CANCELING] Pipeline #45892595: canceled |
Known issue. @nvjullin Please rebase this PR. thanks |
FYI, must be rebased on top of #2737. |
2b58537 to
d7033a2
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
flashinfer/fp4_quantization.py (1)
440-447: Consider extracting the repeated NVFP4 scale-buffer sizing.The
scale_k→padded_k→padded_msequence now appears in four real/fake paths. Pulling it into a helper would make future padding changes much harder to miss in one branch.Also applies to: 476-483, 524-531, 564-571
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fp4_quantization.py` around lines 440 - 447, Extract the repeated NVFP4 scale-buffer sizing logic (scale_k = k // sf_vec_size; padded_k = round_up(scale_k, 4); padded_k_int32 = padded_k // 4; padded_m = round_up(m, 128)) into a small helper function (e.g., compute_nvfp4_scale_dims or nvfp4_scale_sizes) and return the computed values (scale_k, padded_k, padded_k_int32, padded_m); then replace the duplicated blocks in the branches that build output/output_scales (the sites using scale_k, padded_k, padded_k_int32, padded_m such as where output = torch.empty(...) and output_scales = torch.empty(...)) to call the helper and use its results, keeping existing variable names (scale_k, padded_k_int32, padded_m) to minimize downstream changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@flashinfer/fp4_quantization.py`:
- Around line 440-447: Extract the repeated NVFP4 scale-buffer sizing logic
(scale_k = k // sf_vec_size; padded_k = round_up(scale_k, 4); padded_k_int32 =
padded_k // 4; padded_m = round_up(m, 128)) into a small helper function (e.g.,
compute_nvfp4_scale_dims or nvfp4_scale_sizes) and return the computed values
(scale_k, padded_k, padded_k_int32, padded_m); then replace the duplicated
blocks in the branches that build output/output_scales (the sites using scale_k,
padded_k, padded_k_int32, padded_m such as where output = torch.empty(...) and
output_scales = torch.empty(...)) to call the helper and use its results,
keeping existing variable names (scale_k, padded_k_int32, padded_m) to minimize
downstream changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c0c2bf85-ad67-4de2-ad0c-4ec241ae6056
📒 Files selected for processing (2)
flashinfer/fp4_quantization.pytests/utils/test_fp4_quantize_padding.py
|
/bot run |
|
[FAILED] Pipeline #45942317: 1/20 passed |
<!-- .github/pull_request_template.md --> ## 📌 Description Linear sf is missing padding (or the kernel shouldn't try to write to non-existent padding), which is required when `quantize_with_block_size_tma` is called. There's a few requirements to hit this code path, some notable ones being m>=1024, n%512==0 and sm100. The offending code is https://github.com/flashinfer-ai/flashinfer/blob/bcdf8d8ac725498416d2995de54323e3c9996f5a/csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh#L454-L457 Solves flashinfer-ai#2704. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Unified padding behavior for FP4 quantization by switching to a consistent round-up alignment across swizzled and non-swizzled paths, preventing misaligned allocations. * Internal buffers now use padded allocations and are trimmed or reshaped before return, preserving public interfaces and avoiding layout/memory issues. * **Tests** * Added tests that validate quantization padding, output shapes, and numerical correctness for unaligned inputs on CUDA across dtypes, with caching bypass and architecture/version guards. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md --> ## 📌 Description Linear sf is missing padding (or the kernel shouldn't try to write to non-existent padding), which is required when `quantize_with_block_size_tma` is called. There's a few requirements to hit this code path, some notable ones being m>=1024, n%512==0 and sm100. The offending code is https://github.com/flashinfer-ai/flashinfer/blob/bcdf8d8ac725498416d2995de54323e3c9996f5a/csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh#L454-L457 Solves flashinfer-ai#2704. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Unified padding behavior for FP4 quantization by switching to a consistent round-up alignment across swizzled and non-swizzled paths, preventing misaligned allocations. * Internal buffers now use padded allocations and are trimmed or reshaped before return, preserving public interfaces and avoiding layout/memory issues. * **Tests** * Added tests that validate quantization padding, output shapes, and numerical correctness for unaligned inputs on CUDA across dtypes, with caching bypass and architecture/version guards. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
📌 Description
Linear sf is missing padding (or the kernel shouldn't try to write to non-existent padding), which is required when
quantize_with_block_size_tmais called. There's a few requirements to hit this code path, some notable ones being m>=1024, n%512==0 and sm100.The offending code is
flashinfer/csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
Lines 454 to 457 in bcdf8d8
Solves #2704.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Bug Fixes
Tests