[Minor] Reduce num blocks of qknorm in small batch size#2264
[Minor] Reduce num blocks of qknorm in small batch size#2264yzh119 merged 1 commit intoflashinfer-ai:mainfrom
Conversation
📝 WalkthroughWalkthroughThis change optimizes kernel grid launching in a normalization operation by capping the block count at the computed minimum needed (based on batch size and heads) rather than always using the maximum available blocks, reducing over-subscription while maintaining per-block computation. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-11-12T03:35:17.583ZApplied to files:
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @DarkSharpness, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a performance optimization to the QKNorm kernel by intelligently limiting the number of CUDA blocks launched. By calculating the minimum required blocks based on the workload and capping the launch configuration, it significantly reduces overhead, especially for smaller batch sizes frequently encountered during the decode phase of inference. This change aims to improve the overall efficiency and speed of operations within FlashInfer. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/bot run |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable optimization by reducing the number of launched blocks for the QKNorm kernel when dealing with small batch sizes, which demonstrably improves performance in decode stages. My review identifies a potential integer overflow issue in the calculation of needed_blocks. This could lead to incorrect behavior or performance degradation with large batch_size or num_heads. I have provided a code suggestion to use 64-bit integers for this calculation to enhance robustness. Additionally, please note the typo in the pull request title ('Redeuce' should be 'Reduce').
|
[SUCCESS] Pipeline #40736015: 12/20 passed |
📌 Description
In QKNorm kernel with small batch size, we can reduce the number of blocks launched. This can reduce block launching overhead especially in decode stage.
A example result on B200 where (batch_size, num_heads, head_dim) = (128, 8, 128), which is common in Qwen3 model decode stage.
Before this PR: 2.448us
After this PR: 1.584us
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.