Skip to content

[Fmha] support nvfp4 output keepsMmaAb generation kernels#2988

Merged
bkryu merged 3 commits intoflashinfer-ai:mainfrom
PerkzZheng:user/perkzz/add-e2m1-output
Apr 8, 2026
Merged

[Fmha] support nvfp4 output keepsMmaAb generation kernels#2988
bkryu merged 3 commits intoflashinfer-ai:mainfrom
PerkzZheng:user/perkzz/add-e2m1-output

Conversation

@PerkzZheng
Copy link
Copy Markdown
Contributor

@PerkzZheng PerkzZheng commented Apr 6, 2026

Same as #2795. It is recreated because the original branch was force-pushed.

  • Update cubin artifact path/checksum to new build with nvfp4 output support
  • Fix kernel selection: remove E2M1 output dtype condition from mixed-precision path, allowing nvfp4 output to use GQA generation kernel selection heuristics
  • Always invoke selectTileSizeQForGqaGeneration (not just for maxSeqLenQ > 1)
  • Add mUsesSharedPagedKvIdx field to KernelParams for vLLM/FlashInfer paged KV index
  • Remove speculative-decode skip for nvfp4 output in tests
  • Expand test coverage: head_dim [64, 128, 256], additional batch configs

AI-assisted

📌 Description

Qwen3-480B (num_qo_heads=96, num_kv_heads=8, head_dim_qk=128, head_dim_vo=128)

Speedup (baseline / opt)

s_qo bs=8 bs=16 bs=32 bs=40 bs=64
2 1.23x 1.32x 1.26x 1.15x 1.28x
4 2.21x 2.49x 2.16x 1.89x 1.93x
8 3.41x 3.32x 3.00x 2.81x 2.94x

GPT-OSS (num_qo_heads=64, num_kv_heads=8, head_dim_qk=64, head_dim_vo=64)

Speedup (baseline / opt)

s_qo bs=8 bs=16 bs=32 bs=40 bs=64
2 1.65x 1.78x 1.77x 1.52x 1.62x
4 2.50x 2.65x 2.41x 2.06x 2.41x
8 4.79x 5.05x 5.12x 4.45x 4.98x

Summary

Speedup scales strongly with s_qo (speculative decode query length):

  • At s_qo=2: 1.1–1.8x speedup across both models
  • At s_qo=4: 1.9–2.6x speedup
  • At s_qo=8: 2.8–5.1x speedup (peak 5.12x on GPT-OSS, bs=32)

🔍 Related Issues

#2632

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Prevented a division-by-zero style edge case in grouped-query attention token grouping by ensuring a minimum token grouping.
    • Adjusted kernel selection logic so mixed-precision cases no longer force alternative kernel/tile choices.
  • Tests

    • Expanded generation-attention tests to cover additional head dimensions and speculative decode scenarios.
    • Made output data type explicit in test planning and refined backend-specific test skips.

- Update cubin artifact path/checksum to new build with nvfp4 output support
- Fix kernel selection: remove E2M1 output dtype condition from mixed-precision path,
  allowing nvfp4 output to use GQA generation kernel selection heuristics
- Always invoke selectTileSizeQForGqaGeneration (not just for maxSeqLenQ > 1)
- Add mUsesSharedPagedKvIdx field to KernelParams for vLLM/FlashInfer paged KV index
- Remove speculative-decode skip for nvfp4 output in tests
- Expand test coverage: head_dim [64, 128, 256], additional batch configs

AI-assisted

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

revert

revert

revert
- Add o_data_type to plan_params in _test_trtllm_batch_prefill and
  _test_trtllm_batch_decode to properly test output dtype selection
- Remove head_dim=64 from test_trtllm_batch_decode and
  test_trtllm_batch_decode_spec parametrize lists due to buggy Sm100f
  SwapsAbForGen cubins in the current artifact store
- Fix std::max(1, ...) guard in fmhaKernels.cuh to avoid
  numTokensPerCtaQ=0 when mStepQ < mNumHeadsQPerKv
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b3bc785f-4936-40e1-adc3-7d05c140f7d7

📥 Commits

Reviewing files that changed from the base of the PR and between 82cf535 and e82d9ad.

📒 Files selected for processing (1)
  • tests/attention/test_trtllm_gen_attention.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/attention/test_trtllm_gen_attention.py

📝 Walkthrough

Walkthrough

Clamps CTA token grouping to avoid zero-valued division and relaxes a dtype gate during GQA kernel selection; expands trtllm generation tests to explicitly set output dtype and add head_dim=256 coverage for decode/speculative cases.

Changes

Cohort / File(s) Summary
FMHA Kernel Configuration
include/flashinfer/trtllm/fmha/fmhaKernels.cuh
Clamp numTokensPerCtaQ to at least 1 in computeCtaAndClusterConfig to avoid zero/ceil_div issues; remove mDtypeOut == DATA_TYPE_E2M1 from selectGqGenerationKernel early-return gating so DATA_TYPE_E2M1 no longer forces alternate kernel/tileSizeQ selection.
TRT-LLM Generation Tests
tests/attention/test_trtllm_gen_attention.py
Set plan_params["o_data_type"] explicitly in prefill/decode wrappers, remove previous nvfp4 skip for longer decode cases, and extend parametrization to include head_dim=256 across decode and speculative decode test matrices (with an xqa/backend-specific skip).

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • saltyminty
  • yyihuang
  • aleozlx
  • yzh119
  • cyx-6
  • sricketts
  • bkryu
  • nvmbreughe

Poem

🐇 A tiny clamp, a subtle change,
No zero quirks will roam the range.
Kernels pick their paths anew,
Tests now check head_dim two-fifty-six too.
Hop, review, and let builds sing!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly identifies the main change: adding nvfp4 output support to keepsMmaAb generation kernels, which aligns with the code modifications and test expansions in the changeset.
Description check ✅ Passed Description includes all required template sections (Description, Related Issues, Pre-commit Checks, Tests) with detailed implementation details, performance benchmarks, and test coverage information. All checklist items are marked complete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@PerkzZheng
Copy link
Copy Markdown
Contributor Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !509 has been created, and the CI pipeline #47805888 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the FMHA kernels to ensure at least one token per CTA during calculation and relaxes constraints on mixed precision kernels by removing the specific check for E2M1 output data types. The test suite is updated to include output data type configuration in planning parameters, expand coverage for head dimensions of 256, and enable nvfp4 support for speculative decoding by removing previous skips. I have no further feedback to provide as the review comments were purely evaluative.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47805888: 9/20 passed

@PerkzZheng
Copy link
Copy Markdown
Contributor Author

@saltyminty sorry for asking you again. it seems that all related tests have been passed (https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/292875212). Feel free to merge if everything looks good. Thank you!

@johnnynunez
Copy link
Copy Markdown
Contributor

@aleozlx take a look, ask from redhat team (vllm)

@bkryu bkryu added the run-ci label Apr 7, 2026
@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Apr 7, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !509 has been updated with latest changes, and the CI pipeline #47955317 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47955317: 8/20 passed

Copy link
Copy Markdown
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PerkzZheng, please check the internal CI failures on SM120 cards on tests/attention/test_trtllm_gen_attention.py

I suspect it has to do with XQA on head size 256. If XQA is not supported, we need to fix on SM120

@PerkzZheng
Copy link
Copy Markdown
Contributor Author

PerkzZheng commented Apr 8, 2026

@PerkzZheng, please check the internal CI failures on SM120 cards on tests/attention/test_trtllm_gen_attention.py

I suspect it has to do with XQA on head size 256. If XQA is not supported, we need to fix on SM120

Thanks. let me skip XQA for headDim 256. @qsang-nv for vis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@PerkzZheng PerkzZheng force-pushed the user/perkzz/add-e2m1-output branch from 7fa1088 to e82d9ad Compare April 8, 2026 03:26
@PerkzZheng
Copy link
Copy Markdown
Contributor Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !509 has been updated with latest changes, and the CI pipeline #47983474 is currently running. I'll report back once the pipeline job completes.

Comment thread tests/attention/test_trtllm_gen_attention.py
Copy link
Copy Markdown
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI now LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants