[bugfix] fix smem alloc for single warp reduce#1643
[bugfix] fix smem alloc for single warp reduce#1643LeiWang1999 merged 2 commits intotile-ai:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughA condition threshold adjustment in the ReduceOp code generation and lowering path changes the workspace allocation trigger for AllReduce operations. The condition that decides to allocate workspace transitions from Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/op/reduce.cc (1)
325-329: Condition threshold is incorrect; workspace is not needed untilreducing_threads >= 64.The condition
reducing_threads > 32allocates workspace unnecessarily for thread counts between 33 and 63. The AllReduce recursion only requires shared memory (whenoffset >= 32) starting atreducing_threads = 64. For example, with 33 threads, the first offset is 16, which uses only warp shuffle operations and does not need workspace. The correct condition should bereducing_threads >= 64.Additionally, this creates an inconsistency:
finalize_reducer.ccuses>= 32(line 113) whilereduce.ccuses> 32(line 325), and the documentation states>= 32. All three should use>= 64for correctness.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/op/reduce.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (1)
src/op/reduce.cc (1)
175-176: LGTM! Documentation properly updated.The comment accurately reflects the new threshold for workspace allocation.
As title. Previous PR #1641 was accidentally closed by my PR bot :(
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.