[Fmha] support nvfp4 output keepsMmaAb generation kernels by PerkzZheng · Pull Request #2988 · flashinfer-ai/flashinfer

PerkzZheng · 2026-04-06T04:46:57Z

Same as #2795. It is recreated because the original branch was force-pushed.

Update cubin artifact path/checksum to new build with nvfp4 output support
Fix kernel selection: remove E2M1 output dtype condition from mixed-precision path, allowing nvfp4 output to use GQA generation kernel selection heuristics
Always invoke selectTileSizeQForGqaGeneration (not just for maxSeqLenQ > 1)
Add mUsesSharedPagedKvIdx field to KernelParams for vLLM/FlashInfer paged KV index
Remove speculative-decode skip for nvfp4 output in tests
Expand test coverage: head_dim [64, 128, 256], additional batch configs

AI-assisted

📌 Description

Qwen3-480B (num_qo_heads=96, num_kv_heads=8, head_dim_qk=128, head_dim_vo=128)

Speedup (baseline / opt)

s_qo	bs=8	bs=16	bs=32	bs=40	bs=64
2	1.23x	1.32x	1.26x	1.15x	1.28x
4	2.21x	2.49x	2.16x	1.89x	1.93x
8	3.41x	3.32x	3.00x	2.81x	2.94x

GPT-OSS (num_qo_heads=64, num_kv_heads=8, head_dim_qk=64, head_dim_vo=64)

Speedup (baseline / opt)

s_qo	bs=8	bs=16	bs=32	bs=40	bs=64
2	1.65x	1.78x	1.77x	1.52x	1.62x
4	2.50x	2.65x	2.41x	2.06x	2.41x
8	4.79x	5.05x	5.12x	4.45x	4.98x

Summary

Speedup scales strongly with s_qo (speculative decode query length):

At s_qo=2: 1.1–1.8x speedup across both models
At s_qo=4: 1.9–2.6x speedup
At s_qo=8: 2.8–5.1x speedup (peak 5.12x on GPT-OSS, bs=32)

🔍 Related Issues

#2632

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Prevented a division-by-zero style edge case in grouped-query attention token grouping by ensuring a minimum token grouping.
- Adjusted kernel selection logic so mixed-precision cases no longer force alternative kernel/tile choices.
Tests
- Expanded generation-attention tests to cover additional head dimensions and speculative decode scenarios.
- Made output data type explicit in test planning and refined backend-specific test skips.

- Update cubin artifact path/checksum to new build with nvfp4 output support - Fix kernel selection: remove E2M1 output dtype condition from mixed-precision path, allowing nvfp4 output to use GQA generation kernel selection heuristics - Always invoke selectTileSizeQForGqaGeneration (not just for maxSeqLenQ > 1) - Add mUsesSharedPagedKvIdx field to KernelParams for vLLM/FlashInfer paged KV index - Remove speculative-decode skip for nvfp4 output in tests - Expand test coverage: head_dim [64, 128, 256], additional batch configs AI-assisted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> revert revert revert

- Add o_data_type to plan_params in _test_trtllm_batch_prefill and _test_trtllm_batch_decode to properly test output dtype selection - Remove head_dim=64 from test_trtllm_batch_decode and test_trtllm_batch_decode_spec parametrize lists due to buggy Sm100f SwapsAbForGen cubins in the current artifact store - Fix std::max(1, ...) guard in fmhaKernels.cuh to avoid numTokensPerCtaQ=0 when mStepQ < mNumHeadsQPerKv

coderabbitai · 2026-04-06T04:47:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b3bc785f-4936-40e1-adc3-7d05c140f7d7

📥 Commits

Reviewing files that changed from the base of the PR and between 82cf535 and e82d9ad.

📒 Files selected for processing (1)

tests/attention/test_trtllm_gen_attention.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/attention/test_trtllm_gen_attention.py

📝 Walkthrough

Walkthrough

Clamps CTA token grouping to avoid zero-valued division and relaxes a dtype gate during GQA kernel selection; expands trtllm generation tests to explicitly set output dtype and add head_dim=256 coverage for decode/speculative cases.

Changes

Cohort / File(s)	Summary
FMHA Kernel Configuration `include/flashinfer/trtllm/fmha/fmhaKernels.cuh`	Clamp `numTokensPerCtaQ` to at least 1 in `computeCtaAndClusterConfig` to avoid zero/ceil_div issues; remove `mDtypeOut == DATA_TYPE_E2M1` from `selectGqGenerationKernel` early-return gating so `DATA_TYPE_E2M1` no longer forces alternate kernel/tileSizeQ selection.
TRT-LLM Generation Tests `tests/attention/test_trtllm_gen_attention.py`	Set `plan_params["o_data_type"]` explicitly in prefill/decode wrappers, remove previous `nvfp4` skip for longer decode cases, and extend parametrization to include `head_dim=256` across decode and speculative decode test matrices (with an xqa/backend-specific skip).

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel #2380: Touches the same GQA CTA/tileSize/head-per-CTA logic in fmhaKernels.cuh.
unittest: Add head dim 256 test cases and mark as xfail #1999: Updates the same test file tests/attention/test_trtllm_gen_attention.py to add/propagate head_dim=256.
[TRTLLM-Gen Fmha] add optimized trtllm-gen decode kernels for high throughput + speculative decoding #2265: Introduced related kernel-selection and CTA-config logic in fmhaKernels.cuh that this change adjusts.

Suggested reviewers

saltyminty
yyihuang
aleozlx
yzh119
cyx-6
sricketts
bkryu
nvmbreughe

Poem

🐇 A tiny clamp, a subtle change,
No zero quirks will roam the range.
Kernels pick their paths anew,
Tests now check head_dim two-fifty-six too.
Hop, review, and let builds sing!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title clearly identifies the main change: adding nvfp4 output support to keepsMmaAb generation kernels, which aligns with the code modifications and test expansions in the changeset.
Description check	✅ Passed	Description includes all required template sections (Description, Related Issues, Pre-commit Checks, Tests) with detailed implementation details, performance benchmarks, and test coverage information. All checklist items are marked complete.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

PerkzZheng · 2026-04-06T04:48:24Z

/bot run

flashinfer-bot · 2026-04-06T04:49:50Z

GitLab MR !509 has been created, and the CI pipeline #47805888 is currently running. I'll report back once the pipeline job completes.

gemini-code-assist

Code Review

This pull request updates the FMHA kernels to ensure at least one token per CTA during calculation and relaxes constraints on mixed precision kernels by removing the specific check for E2M1 output data types. The test suite is updated to include output data type configuration in planning parameters, expand coverage for head dimensions of 256, and enable nvfp4 support for speculative decoding by removing previous skips. I have no further feedback to provide as the review comments were purely evaluative.

flashinfer-bot · 2026-04-06T09:46:20Z

[FAILED] Pipeline #47805888: 9/20 passed

PerkzZheng · 2026-04-06T09:50:11Z

@saltyminty sorry for asking you again. it seems that all related tests have been passed (https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/292875212). Feel free to merge if everything looks good. Thank you!

johnnynunez · 2026-04-07T20:26:18Z

@aleozlx take a look, ask from redhat team (vllm)

bkryu · 2026-04-07T20:30:27Z

/bot run

flashinfer-bot · 2026-04-07T20:31:18Z

GitLab MR !509 has been updated with latest changes, and the CI pipeline #47955317 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-04-08T02:13:40Z

[FAILED] Pipeline #47955317: 8/20 passed

bkryu

@PerkzZheng, please check the internal CI failures on SM120 cards on tests/attention/test_trtllm_gen_attention.py

I suspect it has to do with XQA on head size 256. If XQA is not supported, we need to fix on SM120

PerkzZheng · 2026-04-08T03:17:25Z

@PerkzZheng, please check the internal CI failures on SM120 cards on tests/attention/test_trtllm_gen_attention.py

I suspect it has to do with XQA on head size 256. If XQA is not supported, we need to fix on SM120

Thanks. let me skip XQA for headDim 256. @qsang-nv for vis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PerkzZheng · 2026-04-08T03:26:38Z

/bot run

flashinfer-bot · 2026-04-08T03:27:21Z

GitLab MR !509 has been updated with latest changes, and the CI pipeline #47983474 is currently running. I'll report back once the pipeline job completes.

bkryu

CI now LGTM.

PerkzZheng added 2 commits April 6, 2026 04:41

PerkzZheng requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners April 6, 2026 04:47

flashinfer-bot added the op: attention label Apr 6, 2026

gemini-code-assist bot reviewed Apr 6, 2026

View reviewed changes

PerkzZheng mentioned this pull request Apr 6, 2026

[Fmha] support nvfp4 output keepsMmaAb generation kernels #2795

Closed

5 tasks

bkryu added the run-ci label Apr 7, 2026

saltyminty approved these changes Apr 7, 2026

View reviewed changes

bkryu requested changes Apr 8, 2026

View reviewed changes

test: skip xqa + head_dim=256 decode cases with known precision issues

e82d9ad

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PerkzZheng force-pushed the user/perkzz/add-e2m1-output branch from 7fa1088 to e82d9ad Compare April 8, 2026 03:26

qsang-nv reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/attention/test_trtllm_gen_attention.py

bkryu approved these changes Apr 8, 2026

View reviewed changes

bkryu merged commit de5a556 into flashinfer-ai:main Apr 8, 2026
32 of 33 checks passed

baonudesifeizhai mentioned this pull request Apr 8, 2026

Add adaptive decode chunking for SM100 fused TRTLLM path (TMP FIX)#34988 vllm-project/vllm#36083

Closed

5 tasks

cindyzxq mentioned this pull request Apr 10, 2026

[Bug]tests/attention/test_trtllm_gen_attention.py failed with AssertionError: assert tensor(False, device='cuda:0') #3029

Open

This was referenced Apr 13, 2026

feat: update trtllm-gen bmm cubins #3045

Draft

fix(fmha_v2): enable flash_attention for Q_PAGED_KV regardless of s_kv #3106

Open

Conversation

PerkzZheng commented Apr 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

Qwen3-480B (num_qo_heads=96, num_kv_heads=8, head_dim_qk=128, head_dim_vo=128)

Speedup (baseline / opt)

GPT-OSS (num_qo_heads=64, num_kv_heads=8, head_dim_qk=64, head_dim_vo=64)

Speedup (baseline / opt)

Summary

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

PerkzZheng commented Apr 6, 2026

Uh oh!

flashinfer-bot commented Apr 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

flashinfer-bot commented Apr 6, 2026

Uh oh!

PerkzZheng commented Apr 6, 2026

Uh oh!

johnnynunez commented Apr 7, 2026

Uh oh!

bkryu commented Apr 7, 2026

Uh oh!

flashinfer-bot commented Apr 7, 2026

Uh oh!

flashinfer-bot commented Apr 8, 2026

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

PerkzZheng commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PerkzZheng commented Apr 8, 2026

Uh oh!

flashinfer-bot commented Apr 8, 2026

Uh oh!

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

PerkzZheng commented Apr 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 6, 2026 •

edited

Loading

PerkzZheng commented Apr 8, 2026 •

edited

Loading