[Fmha] revert blackwell ultra optimization that causes deadlocks. by PerkzZheng · Pull Request #2956 · flashinfer-ai/flashinfer

PerkzZheng · 2026-04-02T09:30:07Z

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

Chores
- Updated TRTLLM GEN FMHA artifact references and associated checksums used for download and verification.
Refactor
- Improved kernel tile-shape handling for paged K/V cache and refined scaling-factor tensor layout to optimize TMA transfers and memory access patterns.

coderabbitai · 2026-04-02T09:30:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e4a92f75-7840-4599-bfa0-615dbbd17d89

📥 Commits

Reviewing files that changed from the base of the PR and between 49f6f6732fd0a248ceef397741d513b4a8412d85 and 4885d5a.

📒 Files selected for processing (2)

flashinfer/artifacts.py
include/flashinfer/trtllm/fmha/kernelParams.h

🚧 Files skipped from review as they are similar to previous changes (2)

flashinfer/artifacts.py
include/flashinfer/trtllm/fmha/kernelParams.h

📝 Walkthrough

Walkthrough

Updated TRTLLM_GEN_FMHA artifact repository hash and checksum in artifacts.py. Extended KernelParams TMA layout logic to compute and apply a conditional reshape factor for K/V and FP4 scaling-factor tensors to target ~128B-wide TMA tile boxes.

Changes

Cohort / File(s)	Summary
Artifact Configuration `flashinfer/artifacts.py`	Updated `ArtifactPath.TRTLLM_GEN_FMHA` repository subdirectory hash and `CheckSumHash.TRTLLM_GEN_FMHA` SHA256 checksum, changing the remote `checksums.txt` lookup and expected artifact checksums.
Kernel Parameters / TMA reshaping `include/flashinfer/trtllm/fmha/kernelParams.h`	Added `reshapeFactor` parameter to TMA shape/stride helpers and logic in `setKernelParams` to compute `canReshapeTmaKv`, derive `reshapeFactorKv` (and `reshapeFactorKvSf` for FP4 SF), and reshape K/V and SF TMA shapes/strides to target ~128B-wide tile geometry, adjusting `tileShapeKv` accordingly.

Sequence Diagram(s)

(Skipped — changes are internal algorithmic updates and an artifact metadata update; no multi-component sequential flow requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

read real strides for kv and block scale #2844 — Touches makeTmaShapeStrideKv / makeTmaShapeStrideKvSf and related TMA stride handling (closely related changes to reshape logic).
[Fmha] Sparse MLA decode kernel selection heuristics #2836 — Updates the same TRTLLM_GEN_FMHA artifact entry and checksum in artifacts.py.
misc: Update artifacts docstring and MetaInfoHash #1967 — Modifies the ArtifactPath/CheckSumHash symbols for TRTLLM_GEN_FMHA used by checksum lookup.

Suggested labels

op: attention

Suggested reviewers

sricketts
bkryu
cyx-6
yzh119
yongwww
samuellees
saltyminty
nv-yunzheq
kahyunnam

Poem

🐰 In tiles of bytes I nudge and play,
I stretch K/V to fit the TMA way,
Hashes hop to a brand-new lane,
SFs shrink and join the frame,
A rabbit cheers for shapes made sane.

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title references reverting a 'blackwell ultra optimization that causes deadlocks,' but the actual changes update FMHA artifact hashes and extend kernel parameters with reshape factors for TMA tensors, which appears unrelated to reverting or fixing deadlock issues.	Clarify the title to accurately describe the actual changes: either update to reflect the artifact hash updates and kernel parameter extensions, or provide context explaining how these changes revert the problematic optimization.
Description check	⚠️ Warning	The PR description contains only the repository's template with pre-commit and test checkboxes marked as complete, but lacks any substantive explanation of the changes, rationale for the revert, or how the modifications address the deadlock issue.	Add a detailed description explaining why the optimization causes deadlocks, what the changes accomplish, and how they resolve the issue, particularly clarifying the relationship between artifact updates and kernel parameter modifications.
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates artifact paths and introduces TMA box widening logic for K/V and Scale Factor (SF) tensors to optimize memory access. Critical issues were identified in the reshaping logic: the current implementation fails to update the underlying tensor shapes and strides to match the widened tiles, which will cause runtime errors. Furthermore, the Scale Factor reshaping calculation lacks robustness and could exceed the 128-byte TMA box width limit, leading to hardware-level failures.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/flashinfer/trtllm/fmha/kernelParams.h`:
- Around line 755-763: The reshapeFactorKvSf calculation can exceed the SF
descriptor’s pre-reshape column count causing tileShapeKvSf[1] to become zero;
compute the original SF column count (origSfCols = numKeysPerTile * maxHeadDimKv
/ 256, i.e., the unreshaped dim1) and cap reshapeFactorKvSf by that value and
then reduce it until it divides origSfCols cleanly (e.g., while(origSfCols %
reshapeFactorKvSf != 0) reshapeFactorKvSf /= 2 or choose the largest divisor <=
cap). Update the code around reshapeFactorKvSf and tileShapeKvSf so
reshapeFactorKvSf never exceeds origSfCols and always divides origSfCols, using
the existing symbols reshapeFactorKvSf, tileShapeKvSf, numKeysPerTile,
maxHeadDimKv, and NumEltsPerSf.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd74778e-999b-441f-9172-9a9bbc63fa0c

📥 Commits

Reviewing files that changed from the base of the PR and between 637209a and 49f6f6732fd0a248ceef397741d513b4a8412d85.

📒 Files selected for processing (2)

flashinfer/artifacts.py
include/flashinfer/trtllm/fmha/kernelParams.h

bkryu · 2026-04-02T16:45:35Z

/bot run

flashinfer-bot · 2026-04-02T16:46:20Z

GitLab MR !492 has been created, and the CI pipeline #47554542 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-04-02T18:56:20Z

[CANCELING] Pipeline #47554542: canceled

bkryu · 2026-04-02T19:52:39Z

/bot run

flashinfer-bot · 2026-04-02T19:53:17Z

GitLab MR !492 has been created, and the CI pipeline #47568921 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-04-03T01:10:52Z

[FAILED] Pipeline #47568921: 5/20 passed

PerkzZheng · 2026-04-03T03:09:47Z

/bot run

flashinfer-bot · 2026-04-03T03:10:26Z

GitLab MR !492 has been updated with latest changes, and the CI pipeline #47598168 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-04-03T13:11:58Z

[FAILED] Pipeline #47598168: 11/20 passed

bkryu

CI results look good to me.

I can also confirm that locally on a B300 that test_trtllm_gen_attention.py that used to hang now passes with

40328 passed, 54352 skipped in 2540.62s (0:42:20)

…evert blackwell ultra optimization that causes deadlocks); bump version to 0.6.7.post2 fix: [Fmha] revert blackwell ultra optimization that causes deadlocks. (#2956)    Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.).   * **Chores** * Updated TRTLLM GEN FMHA artifact references and associated checksums used for download and verification. * **Refactor** * Improved kernel tile-shape handling for paged K/V cache and refined scaling-factor tensor layout to optimize TMA transfers and memory access patterns.

PerkzZheng requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners April 2, 2026 09:30

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread include/flashinfer/trtllm/fmha/kernelParams.h Outdated

Comment thread include/flashinfer/trtllm/fmha/kernelParams.h

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread include/flashinfer/trtllm/fmha/kernelParams.h Outdated

bkryu added the run-ci label Apr 2, 2026

Fridge003 mentioned this pull request Apr 2, 2026

[Bug] Any model hangs at high concurrency on (G)B300 (SM103) with TRTLLM attention sgl-project/sglang#21904

Closed

perf: [Fmha] Reshape TMA box for K, V and SFs so the width is 128B

4885d5a

PerkzZheng force-pushed the user/perkzz/fix-b300 branch from 49f6f67 to 4885d5a Compare April 3, 2026 03:08

bkryu mentioned this pull request Apr 3, 2026

[Bug] TRTLLM attention hangs on GB300 (SM103) with FlashInfer 0.6.7 #2939

Closed

yzh119 approved these changes Apr 3, 2026

View reviewed changes

bkryu approved these changes Apr 3, 2026

View reviewed changes

bkryu merged commit 5758837 into flashinfer-ai:main Apr 3, 2026
30 of 32 checks passed

coderabbitai Bot mentioned this pull request Apr 4, 2026

[Fmha] Add head_dim=512 support for trtllm attention kernels #2959

Merged

cjackal mentioned this pull request Apr 4, 2026

[NVIDIA] Update FlashInfer to version 0.6.7.post3. Avoid re-downloading BMM export headers when flashinfer-cubin is installed vllm-project/vllm#38913

Closed

aleozlx linked an issue Apr 7, 2026 that may be closed by this pull request

[Bug] TRTLLM attention hangs on GB300 (SM103) with FlashInfer 0.6.7 #2939

Closed

bai mentioned this pull request Apr 16, 2026

Update flashinfer to 0.6.8 vllm-project/vllm#39959

Merged

coderabbitai Bot mentioned this pull request May 7, 2026

Add dynamic tokens-per-page TRTLLM-GEN GQA kernels #3259

Merged

wzhao18 mentioned this pull request May 8, 2026

Trtllm-gen FP8 Sparse Attention Kernel has unusually bad performance at TP=2 #2797

Closed

coderabbitai Bot mentioned this pull request May 14, 2026

Update trtllm FMHA cubins #3317

Merged

Conversation

PerkzZheng commented Apr 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (3 warnings)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkryu commented Apr 2, 2026

Uh oh!

flashinfer-bot commented Apr 2, 2026

Uh oh!

flashinfer-bot commented Apr 2, 2026

Uh oh!

bkryu commented Apr 2, 2026

Uh oh!

flashinfer-bot commented Apr 2, 2026

Uh oh!

flashinfer-bot commented Apr 3, 2026

Uh oh!

PerkzZheng commented Apr 3, 2026

Uh oh!

flashinfer-bot commented Apr 3, 2026

Uh oh!

flashinfer-bot commented Apr 3, 2026

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PerkzZheng commented Apr 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading