Skip to content

[Fmha] revert blackwell ultra optimization that causes deadlocks.#2956

Merged
bkryu merged 1 commit into
flashinfer-ai:mainfrom
PerkzZheng:user/perkzz/fix-b300
Apr 3, 2026
Merged

[Fmha] revert blackwell ultra optimization that causes deadlocks.#2956
bkryu merged 1 commit into
flashinfer-ai:mainfrom
PerkzZheng:user/perkzz/fix-b300

Conversation

@PerkzZheng
Copy link
Copy Markdown
Contributor

@PerkzZheng PerkzZheng commented Apr 2, 2026

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

  • Chores

    • Updated TRTLLM GEN FMHA artifact references and associated checksums used for download and verification.
  • Refactor

    • Improved kernel tile-shape handling for paged K/V cache and refined scaling-factor tensor layout to optimize TMA transfers and memory access patterns.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e4a92f75-7840-4599-bfa0-615dbbd17d89

📥 Commits

Reviewing files that changed from the base of the PR and between 49f6f6732fd0a248ceef397741d513b4a8412d85 and 4885d5a.

📒 Files selected for processing (2)
  • flashinfer/artifacts.py
  • include/flashinfer/trtllm/fmha/kernelParams.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • flashinfer/artifacts.py
  • include/flashinfer/trtllm/fmha/kernelParams.h

📝 Walkthrough

Walkthrough

Updated TRTLLM_GEN_FMHA artifact repository hash and checksum in artifacts.py. Extended KernelParams TMA layout logic to compute and apply a conditional reshape factor for K/V and FP4 scaling-factor tensors to target ~128B-wide TMA tile boxes.

Changes

Cohort / File(s) Summary
Artifact Configuration
flashinfer/artifacts.py
Updated ArtifactPath.TRTLLM_GEN_FMHA repository subdirectory hash and CheckSumHash.TRTLLM_GEN_FMHA SHA256 checksum, changing the remote checksums.txt lookup and expected artifact checksums.
Kernel Parameters / TMA reshaping
include/flashinfer/trtllm/fmha/kernelParams.h
Added reshapeFactor parameter to TMA shape/stride helpers and logic in setKernelParams to compute canReshapeTmaKv, derive reshapeFactorKv (and reshapeFactorKvSf for FP4 SF), and reshape K/V and SF TMA shapes/strides to target ~128B-wide tile geometry, adjusting tileShapeKv accordingly.

Sequence Diagram(s)

(Skipped — changes are internal algorithmic updates and an artifact metadata update; no multi-component sequential flow requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

op: attention

Suggested reviewers

  • sricketts
  • bkryu
  • cyx-6
  • yzh119
  • yongwww
  • samuellees
  • saltyminty
  • nv-yunzheq
  • kahyunnam

Poem

🐰 In tiles of bytes I nudge and play,
I stretch K/V to fit the TMA way,
Hashes hop to a brand-new lane,
SFs shrink and join the frame,
A rabbit cheers for shapes made sane.

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title references reverting a 'blackwell ultra optimization that causes deadlocks,' but the actual changes update FMHA artifact hashes and extend kernel parameters with reshape factors for TMA tensors, which appears unrelated to reverting or fixing deadlock issues. Clarify the title to accurately describe the actual changes: either update to reflect the artifact hash updates and kernel parameter extensions, or provide context explaining how these changes revert the problematic optimization.
Description check ⚠️ Warning The PR description contains only the repository's template with pre-commit and test checkboxes marked as complete, but lacks any substantive explanation of the changes, rationale for the revert, or how the modifications address the deadlock issue. Add a detailed description explaining why the optimization causes deadlocks, what the changes accomplish, and how they resolve the issue, particularly clarifying the relationship between artifact updates and kernel parameter modifications.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates artifact paths and introduces TMA box widening logic for K/V and Scale Factor (SF) tensors to optimize memory access. Critical issues were identified in the reshaping logic: the current implementation fails to update the underlying tensor shapes and strides to match the widened tiles, which will cause runtime errors. Furthermore, the Scale Factor reshaping calculation lacks robustness and could exceed the 128-byte TMA box width limit, leading to hardware-level failures.

Comment thread include/flashinfer/trtllm/fmha/kernelParams.h Outdated
Comment thread include/flashinfer/trtllm/fmha/kernelParams.h
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/flashinfer/trtllm/fmha/kernelParams.h`:
- Around line 755-763: The reshapeFactorKvSf calculation can exceed the SF
descriptor’s pre-reshape column count causing tileShapeKvSf[1] to become zero;
compute the original SF column count (origSfCols = numKeysPerTile * maxHeadDimKv
/ 256, i.e., the unreshaped dim1) and cap reshapeFactorKvSf by that value and
then reduce it until it divides origSfCols cleanly (e.g., while(origSfCols %
reshapeFactorKvSf != 0) reshapeFactorKvSf /= 2 or choose the largest divisor <=
cap). Update the code around reshapeFactorKvSf and tileShapeKvSf so
reshapeFactorKvSf never exceeds origSfCols and always divides origSfCols, using
the existing symbols reshapeFactorKvSf, tileShapeKvSf, numKeysPerTile,
maxHeadDimKv, and NumEltsPerSf.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd74778e-999b-441f-9172-9a9bbc63fa0c

📥 Commits

Reviewing files that changed from the base of the PR and between 637209a and 49f6f6732fd0a248ceef397741d513b4a8412d85.

📒 Files selected for processing (2)
  • flashinfer/artifacts.py
  • include/flashinfer/trtllm/fmha/kernelParams.h

Comment thread include/flashinfer/trtllm/fmha/kernelParams.h Outdated
@bkryu bkryu added the run-ci label Apr 2, 2026
@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Apr 2, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !492 has been created, and the CI pipeline #47554542 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[CANCELING] Pipeline #47554542: canceled

@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Apr 2, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !492 has been created, and the CI pipeline #47568921 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47568921: 5/20 passed

@PerkzZheng PerkzZheng force-pushed the user/perkzz/fix-b300 branch from 49f6f67 to 4885d5a Compare April 3, 2026 03:08
@PerkzZheng
Copy link
Copy Markdown
Contributor Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !492 has been updated with latest changes, and the CI pipeline #47598168 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47598168: 11/20 passed

Copy link
Copy Markdown
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI results look good to me.

I can also confirm that locally on a B300 that test_trtllm_gen_attention.py that used to hang now passes with

40328 passed, 54352 skipped in 2540.62s (0:42:20)

@bkryu bkryu merged commit 5758837 into flashinfer-ai:main Apr 3, 2026
30 of 32 checks passed
aleozlx added a commit that referenced this pull request Apr 3, 2026
…evert blackwell ultra optimization that causes deadlocks); bump version to 0.6.7.post2

fix: [Fmha] revert blackwell ultra optimization that causes deadlocks. (#2956)

<!-- .github/pull_request_template.md -->

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

<!-- Link any related issues here -->

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Chores**
* Updated TRTLLM GEN FMHA artifact references and associated checksums
used for download and verification.

* **Refactor**
* Improved kernel tile-shape handling for paged K/V cache and refined
scaling-factor tensor layout to optimize TMA transfers and memory access
patterns.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@aleozlx aleozlx linked an issue Apr 7, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] TRTLLM attention hangs on GB300 (SM103) with FlashInfer 0.6.7

4 participants