[bugfix] fix smem alloc for single warp reduce#1641
[bugfix] fix smem alloc for single warp reduce#1641botbw wants to merge 0 commit intotile-ai:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughA single threshold in the thread-reduction path of Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/op/reduce.cc (1)
174-176: Update the docstring to reflect the new threshold.The docstring states workspace is allocated for
>= 32threads, but the implementation now uses> 32(line 325). Please update the documentation to match the new behavior.📝 Proposed documentation fix
* - Detects parallel thread splitting from the normalized iterator sum and * emits a call to a templated `tl::AllReduce<...>::run` (or `run_hopper`) * via `builtin::call_extern`. For sufficiently large reducing thread counts - * (>= 32) a workspace is allocated via T.AddWorkspace and passed to the + * (> 32) a workspace is allocated via T.AddWorkspace and passed to the * AllReduce call.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/op/reduce.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (1)
src/op/reduce.cc (1)
325-329: LGTM! Correct fix for single-warp reduction.The change correctly prevents workspace allocation for exactly 32 threads (single warp). Single-warp reductions can use warp-level primitives without requiring shared memory workspace, making this allocation unnecessary and potentially wasteful.
There was a problem hiding this comment.
Pull request overview
This pull request fixes a bug in the shared memory workspace allocation logic for reduction operations. The fix changes the threshold for allocating workspace from >= 32 to > 32 threads, correctly handling the edge case where exactly 32 threads (a single warp on CUDA) are used for reduction.
Key Changes:
- Updated the workspace allocation threshold to only allocate when thread count exceeds 32, not when it equals 32
- Fixed both the code logic and accompanying documentation comment
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
As title
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.