[GQA] Add regional atomic add to slightly boost performance#1093
[GQA] Add regional atomic add to slightly boost performance#1093LeiWang1999 merged 6 commits intotile-ai:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
WalkthroughThe backward pass kernel in a flash attention example refactors atomic updates from per-element loops to vectorized slice-based operations for dQ, dV, and dK tensors. The control flow remains unchanged, but accumulation steps are restructured to use contiguous slices instead of per-element atomic additions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes The changes involve targeted logic refactoring of atomic operations and tensor slicing patterns within a single example file. Review requires understanding of atomic semantics, vectorized memory operations, and tensor indexing correctness, but is localized to a specific optimization path without branching concerns. Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…1093) * [Lint] * [BugFix] Freeze the memory order of all atomic_add operations * [Lint] * [Atomic] Move on to regional atomic add * [Lint]
Summary by CodeRabbit