Skip to content

[bugfix] fix smem alloc for single warp reduce#1641

Closed
botbw wants to merge 0 commit intotile-ai:mainfrom
botbw:main
Closed

[bugfix] fix smem alloc for single warp reduce#1641
botbw wants to merge 0 commit intotile-ai:mainfrom
botbw:main

Conversation

@botbw
Copy link
Contributor

@botbw botbw commented Jan 8, 2026

As title

Summary by CodeRabbit

  • Bug Fixes
    • Adjusted workspace allocation logic used during reduction operations to correct an edge-case allocation threshold—resolving rare incorrect behavior and improving stability and predictability of reduction performance.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

A single threshold in the thread-reduction path of src/op/reduce.cc was changed from reducing_threads >= 32 to reducing_threads > 32, narrowing when a workspace is allocated and passed into the AllReduce call.

Changes

Cohort / File(s) Summary
Thread-reduction threshold adjustment
src/op/reduce.cc
Changed workspace allocation condition from reducing_threads >= 32 to reducing_threads > 32 and updated the corresponding comment.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A tiny hop, a nitty bit tweak,
One sign trimmed down, now the threshold is sleek.
Threads counted sharper, a whisper of change,
Small but deliberate within the range.
I nibble the byte and then hop away—huzzah!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: fixing shared memory allocation for single-warp reduce by adjusting the workspace allocation threshold from >= 32 to > 32 threads.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 581d51f and b299cfc.

📒 Files selected for processing (1)
  • src/op/reduce.cc
🔇 Additional comments (2)
src/op/reduce.cc (2)

175-176: Documentation correctly updated to reflect the new threshold.

The comment now accurately describes that workspace is allocated for > 32 threads instead of >= 32, matching the implementation change.


325-329: Correct fix for single-warp reduction.

The change correctly skips workspace allocation for exactly 32 threads (one warp). Single-warp reductions can use warp-level primitives without shared memory, while multi-warp reductions (> 32 threads) require workspace for inter-warp communication.

The AllReduce::run() and AllReduce::run_hopper() methods both accept an optional red_buf parameter (defaults to nullptr), allowing them to handle both call patterns correctly:

  • Two arguments (no workspace) for reducing_threads <= 32
  • Three arguments (with workspace) for reducing_threads > 32

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/op/reduce.cc (1)

174-176: Update the docstring to reflect the new threshold.

The docstring states workspace is allocated for >= 32 threads, but the implementation now uses > 32 (line 325). Please update the documentation to match the new behavior.

📝 Proposed documentation fix
  * - Detects parallel thread splitting from the normalized iterator sum and
  *   emits a call to a templated `tl::AllReduce<...>::run` (or `run_hopper`)
  *   via `builtin::call_extern`. For sufficiently large reducing thread counts
- *   (>= 32) a workspace is allocated via T.AddWorkspace and passed to the
+ *   (> 32) a workspace is allocated via T.AddWorkspace and passed to the
  *   AllReduce call.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a56212d and 581d51f.

📒 Files selected for processing (1)
  • src/op/reduce.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (1)
src/op/reduce.cc (1)

325-329: LGTM! Correct fix for single-warp reduction.

The change correctly prevents workspace allocation for exactly 32 threads (single warp). Single-warp reductions can use warp-level primitives without requiring shared memory workspace, making this allocation unnecessary and potentially wasteful.

LeiWang1999
LeiWang1999 previously approved these changes Jan 8, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a bug in the shared memory workspace allocation logic for reduction operations. The fix changes the threshold for allocating workspace from >= 32 to > 32 threads, correctly handling the edge case where exactly 32 threads (a single warp on CUDA) are used for reduction.

Key Changes:

  • Updated the workspace allocation threshold to only allocate when thread count exceeds 32, not when it equals 32
  • Fixed both the code logic and accompanying documentation comment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants