Skip to content

fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel#2380

Merged
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
dongjiyingdjy:fix_trtllm_decode
Jan 21, 2026
Merged

fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel#2380
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
dongjiyingdjy:fix_trtllm_decode

Conversation

@dongjiyingdjy
Copy link
Copy Markdown
Contributor

@dongjiyingdjy dongjiyingdjy commented Jan 20, 2026

📌 Description

Skip candidates where kernelMeta.mStepQ < params.mNumHeadsQPerKv in GQA tile selection to avoid numTokensPerCtaQ=0, resulting divide-by-zero crash.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Performance
    • Refined GPU tile-size selection to enforce a lower bound on candidates, producing more consistent and often improved modeling time and resource utilization during GPU-accelerated operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 20, 2026

📝 Walkthrough

Walkthrough

A filtering condition in the GQA tile-size heuristic is tightened: candidate tileSizeQ values must satisfy both tileSizeQ ≤ defaultTileSizeQ and tileSizeQ ≥ mNumHeadsQPerKv, changing which candidates are evaluated during tile-size selection.

Changes

Cohort / File(s) Summary
GQA Tile-Size Heuristic Constraint
include/flashinfer/trtllm/fmha/fmhaKernels.cuh
Added lower-bound check tileSizeQ >= mNumHeadsQPerKv to the candidate filtering condition alongside the existing upper-bound check, narrowing candidate tileSizeQ set and affecting modeling time evaluation and final tile selection.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I nibble code beneath the moonlight,

tiles trimmed to fit just right,
bounds aligned in tidy rows—
a quieter kernel softly grows.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately describes the main change: ensuring each CTA processes the full numHeadsQPerKv, which aligns with the filtering condition change that now requires tileSizeQ >= mNumHeadsQPerKv.
Description check ✅ Passed The pull request description provides a clear, focused explanation of the bug fix (avoiding divide-by-zero crash by skipping invalid candidates), with the required Description section completed and Related Issues/Reviewer Notes sections appropriately marked as empty.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yzh119 yzh119 changed the title fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode … fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel Jan 20, 2026
@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Jan 20, 2026

Hi @dongjiyingdjy would you mind explaning the context of this pull requests in the PR description?

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 20, 2026

Caution

Docstrings generation - FAILED

No docstrings were generated.

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Jan 20, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !250 has been created, and the CI pipeline #42126090 is currently running. I'll report back once the pipeline job completes.

@yzh119 yzh119 added the v0.6.2 label Jan 20, 2026
@dongjiyingdjy
Copy link
Copy Markdown
Contributor Author

Hi @dongjiyingdjy would you mind explaning the context of this pull requests in the PR description?

This PR ensures that all Q heads within the same group are in the same CTA. The previous tile select strategy did not account for this, which could cause Q heads from a single group to be split across multiple CTAs, leading to incorrect results.

@yzh119 yzh119 merged commit 6e93b67 into flashinfer-ai:main Jan 21, 2026
14 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants