fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel by dongjiyingdjy · Pull Request #2380 · flashinfer-ai/flashinfer

dongjiyingdjy · 2026-01-20T08:46:02Z

📌 Description

Skip candidates where kernelMeta.mStepQ < params.mNumHeadsQPerKv in GQA tile selection to avoid numTokensPerCtaQ=0, resulting divide-by-zero crash.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Performance
- Refined GPU tile-size selection to enforce a lower bound on candidates, producing more consistent and often improved modeling time and resource utilization during GPU-accelerated operations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-20T08:46:17Z

📝 Walkthrough

Walkthrough

A filtering condition in the GQA tile-size heuristic is tightened: candidate tileSizeQ values must satisfy both tileSizeQ ≤ defaultTileSizeQ and tileSizeQ ≥ mNumHeadsQPerKv, changing which candidates are evaluated during tile-size selection.

Changes

Cohort / File(s)	Summary
GQA Tile-Size Heuristic Constraint `include/flashinfer/trtllm/fmha/fmhaKernels.cuh`	Added lower-bound check `tileSizeQ >= mNumHeadsQPerKv` to the candidate filtering condition alongside the existing upper-bound check, narrowing candidate tileSizeQ set and affecting modeling time evaluation and final tile selection.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I nibble code beneath the moonlight,

tiles trimmed to fit just right,
bounds aligned in tidy rows—
a quieter kernel softly grows.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title accurately describes the main change: ensuring each CTA processes the full numHeadsQPerKv, which aligns with the filtering condition change that now requires tileSizeQ >= mNumHeadsQPerKv.
Description check	✅ Passed	The pull request description provides a clear, focused explanation of the bug fix (avoiding divide-by-zero crash by skipping invalid candidates), with the required Description section completed and Related Issues/Reviewer Notes sections appropriately marked as empty.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…kernel

yzh119 · 2026-01-20T09:27:57Z

Hi @dongjiyingdjy would you mind explaning the context of this pull requests in the PR description?

coderabbitai · 2026-01-20T09:43:30Z

Caution

Docstrings generation - FAILED

No docstrings were generated.

yzh119 · 2026-01-20T18:40:54Z

/bot run

flashinfer-bot · 2026-01-20T18:41:23Z

GitLab MR !250 has been created, and the CI pipeline #42126090 is currently running. I'll report back once the pipeline job completes.

dongjiyingdjy · 2026-01-21T03:39:43Z

Hi @dongjiyingdjy would you mind explaning the context of this pull requests in the PR description?

This PR ensures that all Q heads within the same group are in the same CTA. The previous tile select strategy did not account for this, which could cause Q heads from a single group to be split across multiple CTAs, leading to incorrect results.

dongjiyingdjy requested review from IwakuraRein, aleozlx, jiahanc, joker-eph and yzh119 as code owners January 20, 2026 08:46

fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode …

728252e

…kernel

dongjiyingdjy force-pushed the fix_trtllm_decode branch from e4e6234 to 728252e Compare January 20, 2026 08:50

PerkzZheng approved these changes Jan 20, 2026

View reviewed changes

yzh119 changed the title ~~fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode …~~ fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel Jan 20, 2026

yzh119 mentioned this pull request Jan 20, 2026

Fix: Skip invalid GQA tile candidates with stepQ < headsQPerKv. #2383

Closed

5 tasks

yzh119 approved these changes Jan 20, 2026

View reviewed changes

yzh119 added the v0.6.2 label Jan 20, 2026

yzh119 merged commit 6e93b67 into flashinfer-ai:main Jan 21, 2026
14 of 23 checks passed

This was referenced Mar 16, 2026

[Fmha] support nvfp4 output keepsMmaAb generation kernels #2795

Open

[Fmha] Sparse MLA decode kernel selection heuristics #2836

Merged

coderabbitai bot mentioned this pull request Mar 25, 2026

Allow BatchDecodeWithPagedKVCacheWrapper for GQA ratio 16 and 32 #2895

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel#2380

fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel#2380
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
dongjiyingdjy:fix_trtllm_decode

dongjiyingdjy commented Jan 20, 2026 •

edited by yzh119

Loading

Uh oh!

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

yzh119 commented Jan 20, 2026

Uh oh!

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

Uh oh!

yzh119 commented Jan 20, 2026

Uh oh!

flashinfer-bot commented Jan 20, 2026

Uh oh!

dongjiyingdjy commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dongjiyingdjy commented Jan 20, 2026 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

yzh119 commented Jan 20, 2026

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 commented Jan 20, 2026

Uh oh!

flashinfer-bot commented Jan 20, 2026

Uh oh!

dongjiyingdjy commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjiyingdjy commented Jan 20, 2026 •

edited by yzh119

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading