[Fmha] revert blackwell ultra optimization that causes deadlocks.#2956
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📥 CommitsReviewing files that changed from the base of the PR and between 49f6f6732fd0a248ceef397741d513b4a8412d85 and 4885d5a. 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughUpdated TRTLLM_GEN_FMHA artifact repository hash and checksum in artifacts.py. Extended KernelParams TMA layout logic to compute and apply a conditional reshape factor for K/V and FP4 scaling-factor tensors to target ~128B-wide TMA tile boxes. Changes
Sequence Diagram(s)(Skipped — changes are internal algorithmic updates and an artifact metadata update; no multi-component sequential flow requiring visualization.) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (3 warnings)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates artifact paths and introduces TMA box widening logic for K/V and Scale Factor (SF) tensors to optimize memory access. Critical issues were identified in the reshaping logic: the current implementation fails to update the underlying tensor shapes and strides to match the widened tiles, which will cause runtime errors. Furthermore, the Scale Factor reshaping calculation lacks robustness and could exceed the 128-byte TMA box width limit, leading to hardware-level failures.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@include/flashinfer/trtllm/fmha/kernelParams.h`:
- Around line 755-763: The reshapeFactorKvSf calculation can exceed the SF
descriptor’s pre-reshape column count causing tileShapeKvSf[1] to become zero;
compute the original SF column count (origSfCols = numKeysPerTile * maxHeadDimKv
/ 256, i.e., the unreshaped dim1) and cap reshapeFactorKvSf by that value and
then reduce it until it divides origSfCols cleanly (e.g., while(origSfCols %
reshapeFactorKvSf != 0) reshapeFactorKvSf /= 2 or choose the largest divisor <=
cap). Update the code around reshapeFactorKvSf and tileShapeKvSf so
reshapeFactorKvSf never exceeds origSfCols and always divides origSfCols, using
the existing symbols reshapeFactorKvSf, tileShapeKvSf, numKeysPerTile,
maxHeadDimKv, and NumEltsPerSf.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bd74778e-999b-441f-9172-9a9bbc63fa0c
📥 Commits
Reviewing files that changed from the base of the PR and between 637209a and 49f6f6732fd0a248ceef397741d513b4a8412d85.
📒 Files selected for processing (2)
flashinfer/artifacts.pyinclude/flashinfer/trtllm/fmha/kernelParams.h
|
/bot run |
|
[CANCELING] Pipeline #47554542: canceled |
|
/bot run |
|
[FAILED] Pipeline #47568921: 5/20 passed |
49f6f67 to
4885d5a
Compare
|
/bot run |
|
[FAILED] Pipeline #47598168: 11/20 passed |
bkryu
left a comment
There was a problem hiding this comment.
CI results look good to me.
I can also confirm that locally on a B300 that test_trtllm_gen_attention.py that used to hang now passes with
40328 passed, 54352 skipped in 2540.62s (0:42:20)
…evert blackwell ultra optimization that causes deadlocks); bump version to 0.6.7.post2 fix: [Fmha] revert blackwell ultra optimization that causes deadlocks. (#2956) <!-- .github/pull_request_template.md --> <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> <!-- Link any related issues here --> Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Chores** * Updated TRTLLM GEN FMHA artifact references and associated checksums used for download and verification. * **Refactor** * Improved kernel tile-shape handling for paged K/V cache and refined scaling-factor tensor layout to optimize TMA transfers and memory access patterns. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Release Notes
Chores
Refactor