Skip to content

[bugfix] fix smem alloc for single warp reduce#1643

Merged
LeiWang1999 merged 2 commits intotile-ai:mainfrom
botbw:bugfix/reduce
Jan 9, 2026
Merged

[bugfix] fix smem alloc for single warp reduce#1643
LeiWang1999 merged 2 commits intotile-ai:mainfrom
botbw:bugfix/reduce

Conversation

@botbw
Copy link
Contributor

@botbw botbw commented Jan 9, 2026

As title. Previous PR #1641 was accidentally closed by my PR bot :(

Summary by CodeRabbit

  • Bug Fixes
    • Optimized memory allocation behavior for reduce operations in specific scenarios, improving performance and resource utilization.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

A condition threshold adjustment in the ReduceOp code generation and lowering path changes the workspace allocation trigger for AllReduce operations. The condition that decides to allocate workspace transitions from reducing_threads >= 32 to reducing_threads > 32, tightening the allocation threshold.

Changes

Cohort / File(s) Summary
ReduceOp Codegen
src/op/reduce.cc
Modified workspace allocation condition for AllReduce from reducing_threads >= 32 to reducing_threads > 32, affecting cases where reducing thread count equals exactly 32.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

  • LeiWang1999

Poem

🐰 A threshold shifted, oh what delight,
Thirty-two threads now skip through the night,
The workspace decides with precision so tight,
Greater than, not equal—a boundary just right! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: fixing shared memory allocation for single warp reduce operations by changing the workspace allocation threshold from >= 32 to > 32 threads.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce68b51 and b299cfc.

📒 Files selected for processing (1)
  • src/op/reduce.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (1)
src/op/reduce.cc (1)

175-176: LGTM! Documentation properly updated.

The comment accurately reflects the new threshold for workspace allocation.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/op/reduce.cc (1)

325-329: Condition threshold is incorrect; workspace is not needed until reducing_threads >= 64.

The condition reducing_threads > 32 allocates workspace unnecessarily for thread counts between 33 and 63. The AllReduce recursion only requires shared memory (when offset >= 32) starting at reducing_threads = 64. For example, with 33 threads, the first offset is 16, which uses only warp shuffle operations and does not need workspace. The correct condition should be reducing_threads >= 64.

Additionally, this creates an inconsistency: finalize_reducer.cc uses >= 32 (line 113) while reduce.cc uses > 32 (line 325), and the documentation states >= 32. All three should use >= 64 for correctness.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce68b51 and b299cfc.

📒 Files selected for processing (1)
  • src/op/reduce.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (1)
src/op/reduce.cc (1)

175-176: LGTM! Documentation properly updated.

The comment accurately reflects the new threshold for workspace allocation.

@LeiWang1999 LeiWang1999 merged commit b246c39 into tile-ai:main Jan 9, 2026
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants