Update trtllm FMHA cubins by djmmoss · Pull Request #3317 · flashinfer-ai/flashinfer

djmmoss · 2026-05-14T00:13:27Z

📌 Description

Updates the trtllm FMHA artifact path and checksum to the newer cubins. Aligns the FMHA parameter ABI expected by those cubins and uses dense mask selection for MLA decode generation kernels.

🔍 Related Issues

None.

🧪 Tests

pre-commit run --all-files
pytest tests/attention/test_cute_dsl_mla_decode.py::test_cute_dsl_vs_trtllm_gen[True-128-1] tests/attention/test_trtllm_ragged_kv_stride.py -q -ra --tb=short
every fifth collected tracked tests/attention/*.py item after -k trtllm: 18314 passed, 22979 skipped, 342803 deselected, 686 warnings
H64 BF16 Q + FP8 KV ULP sweep: 7/7 PASS, all mismatches 0

Reviewer Notes

None.

coderabbitai · 2026-05-14T00:13:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 83c82cde-4a6c-4c46-a415-63e89f224208

📥 Commits

Reviewing files that changed from the base of the PR and between 9113c91 and a6b9087.

📒 Files selected for processing (3)

csrc/trtllm_fmha_kernel_launcher.cu
flashinfer/artifacts.py
include/flashinfer/trtllm/fmha/kernelParams.h

✅ Files skipped from review due to trivial changes (1)

flashinfer/artifacts.py

🚧 Files skipped from review as they are similar to previous changes (2)

csrc/trtllm_fmha_kernel_launcher.cu
include/flashinfer/trtllm/fmha/kernelParams.h

📝 Walkthrough

Walkthrough

This PR: (1) uses an is_mla_decode flag to gate sparse-MLA and selects Dense vs Causal mask for MLA-decode in generation, (2) adjusts KernelParams TMA descriptor/ABI and adds K/V TMA reshape validation + persisted reshape factor, and (3) updates TRTLLM-FMHA artifact path and checksum.

Changes

TRTLLM FMHA MLA Decode and TMA Reshape Updates

Layer / File(s)	Summary
MLA-decode detection and generation mask selection `csrc/trtllm_fmha_kernel_launcher.cu`	Introduced `is_mla_decode` for head-dim pairs (576/512 or 320/256); sparse-MLA gating now requires `is_mla_decode`; generation-mode `runner_params.mMaskType` is `Dense` when `is_mla_decode`, otherwise `Causal`.
KernelParams TMA descriptor & ABI layout changes `include/flashinfer/trtllm/fmha/kernelParams.h`	Removed sliding-window KV pool map, added `tmaOSf_`, added `mReservedAttentionWindowState[2]`, added `mReshapeFactorKv`, and removed `ptrSparseMlaTopKLens` field.
K/V TMA reshape validation and persisted reshape factor `include/flashinfer/trtllm/fmha/kernelParams.h`	Added templated helper `canUseTmaKvReshape(...)` that checks descriptor stride vs physical head-dim under element packing; `canReshapeTmaKv` now requires this check for K and V; computed `reshapeFactorKv` is stored in `params.mReshapeFactorKv`.
Artifact path and checksum update `flashinfer/artifacts.py`	Updated `ArtifactPath.TRTLLM_GEN_FMHA` artifactory subpath hash and `CheckSumHash.TRTLLM_GEN_FMHA` checksum to new values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

flashinfer-ai/flashinfer#2956: Modifies K/V TMA reshape logic and FMHA GEN artifact hashes (closely related kernelParams.h changes).
flashinfer-ai/flashinfer#3259: Also updates TRTLLM-GEN FMHA artifact constants and touches FMHA header/layout integration.
flashinfer-ai/flashinfer#2265: Changes generation branch mask selection in the TRTLLM FMHA launcher (related to mMaskType handling).

Suggested labels

run-ci, op: attention

Suggested reviewers

sricketts
aleozlx
yzh119
cyx-6
yongwww

Poem

🐇 A rabbit hops through kernel rows with glee,
It marks the MLA pairs and sets the mask to be,
Reshapes KV safely, stores the factor near,
Updates artifact hashes — the new builds cheer,
Hooray for tiny hops that keep inference clear.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change—updating TRTLLM FMHA cubins—which is the primary objective of the PR.
Description check	✅ Passed	The PR description covers the main changes and includes a testing section with specific test results, but the optional pre-commit and unit test checklist items are not explicitly marked.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the TRT-LLM FMHA kernel launcher and parameters to align with newer cubin ABIs, including support for output scaling factors and refined TMA box reshaping logic. Key changes involve adding tmaOSf_ and mReshapeFactorKv to the KernelParams struct and implementing conditional mask type selection for MLA decode. Review feedback identifies a critical omission where the new tmaOSf_ descriptor is not initialized in setKernelParams, which could lead to undefined behavior. Additionally, it is recommended to refactor duplicated MLA detection logic for better maintainability and to use specific K/V data types when checking for TMA reshape compatibility to improve robustness.

gemini-code-assist · 2026-05-14T00:15:04Z

+  // TMA descriptor for V.
+  CUtensorMap tmaV_;
+  // TMA descriptor for output scaling factor.
+  CUtensorMap tmaOSf_;


The new tmaOSf_ member is added to the KernelParams struct to align with the newer cubin ABI, but it is not initialized in the setKernelParams function. If the newer cubins expect a valid TMA descriptor for output scaling factors (e.g., when performing FP4 quantization on output), this will lead to undefined behavior or crashes as the descriptor will be all zeros. Please add the necessary logic in setKernelParams to build the TMA descriptor for tmaOSf_ when options.oSfPtr is provided, similar to how tmaKSf_ and tmaVSf_ are handled.

@djmmoss I am not quite sure why those changes are made. please add me to review next time. Thanks!
And no worries about that. I will revert them in my MR.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/flashinfer/trtllm/fmha/kernelParams.h`:
- Around line 703-706: The code block defining the canReshapeTmaKv boolean is
misformatted; run clang-format (e.g., pre-commit run clang-format --all-files)
to reformat this declaration so spacing and line breaks follow project style for
the canReshapeTmaKv initializer that references isPagedKv(options.mQkvLayout),
options.mHeadDimQk, swizzleKv, canUseTmaKvReshape(options,
kernelMeta.mDataTypeKv, /*isK*/ true/false), and ensure the file
include/flashinfer/trtllm/fmha/kernelParams.h is committed after formatting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 12a74a50-4c9b-43b3-90f4-2ca34f9b744f

📥 Commits

Reviewing files that changed from the base of the PR and between f6d6bd4 and ac48b6e.

📒 Files selected for processing (3)

csrc/trtllm_fmha_kernel_launcher.cu
flashinfer/artifacts.py
include/flashinfer/trtllm/fmha/kernelParams.h

Point FMHA at the newer public trtllm cubin publish, align the FMHA parameter ABI, and use dense mask selection for MLA decode kernels.

djmmoss · 2026-05-15T15:30:54Z

/bot run

flashinfer-bot · 2026-05-15T15:32:04Z

GitLab MR !677 has been created, and the CI pipeline #51413113 is currently running. I'll report back once the pipeline job completes.

djmmoss · 2026-05-16T01:12:13Z

All tests are green; can we get this one in? @yzh119 @aleozlx

jimmyzho

LGTM

djmmoss requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners May 14, 2026 00:13

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Comment thread include/flashinfer/trtllm/fmha/kernelParams.h Outdated

djmmoss force-pushed the dmoss/fmha-trtllm-public-cubins branch from ac48b6e to 9113c91 Compare May 14, 2026 00:27

djmmoss changed the title ~~Update TRTLLM FMHA public cubins~~ Update trtllm FMHA public cubins May 14, 2026

djmmoss changed the title ~~Update trtllm FMHA public cubins~~ Update trtllm FMHA cubins May 14, 2026

Update trtllm FMHA public cubins

a6b9087

Point FMHA at the newer public trtllm cubin publish, align the FMHA parameter ABI, and use dense mask selection for MLA decode kernels.

djmmoss force-pushed the dmoss/fmha-trtllm-public-cubins branch from 9113c91 to a6b9087 Compare May 14, 2026 23:55

djmmoss requested a review from dhiraj113 as a code owner May 14, 2026 23:55

jimmyzho approved these changes May 19, 2026

View reviewed changes

jimmyzho merged commit 9035311 into flashinfer-ai:main May 19, 2026
31 checks passed

coderabbitai Bot mentioned this pull request May 21, 2026

KV Split Oversubscription for Mixed Sequence Lengths #3379

Open

5 tasks

Conversation

djmmoss commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🧪 Tests

Reviewer Notes

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

PerkzZheng May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

djmmoss commented May 15, 2026

Uh oh!

flashinfer-bot commented May 15, 2026

Uh oh!

djmmoss commented May 16, 2026

Uh oh!

jimmyzho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

djmmoss commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

PerkzZheng May 20, 2026 •

edited

Loading