[bugfix] fix smem alloc for single warp reduce by botbw · Pull Request #1641 · tile-ai/tilelang

botbw · 2026-01-08T10:17:17Z

As title

Summary by CodeRabbit

Bug Fixes
- Adjusted workspace allocation logic used during reduction operations to correct an edge-case allocation threshold—resolving rare incorrect behavior and improving stability and predictability of reduction performance.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2026-01-08T10:17:27Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-01-08T10:17:28Z

📝 Walkthrough

Walkthrough

A single threshold in the thread-reduction path of src/op/reduce.cc was changed from reducing_threads >= 32 to reducing_threads > 32, narrowing when a workspace is allocated and passed into the AllReduce call.

Changes

Cohort / File(s)	Summary
Thread-reduction threshold adjustment `src/op/reduce.cc`	Changed workspace allocation condition from `reducing_threads >= 32` to `reducing_threads > 32` and updated the corresponding comment.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A tiny hop, a nitty bit tweak,
One sign trimmed down, now the threshold is sleek.
Threads counted sharper, a whisper of change,
Small but deliberate within the range.
I nibble the byte and then hop away—huzzah!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: fixing shared memory allocation for single-warp reduce by adjusting the workspace allocation threshold from >= 32 to > 32 threads.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 581d51f and b299cfc.

📒 Files selected for processing (1)

src/op/reduce.cc

🔇 Additional comments (2)

src/op/reduce.cc (2)

175-176: Documentation correctly updated to reflect the new threshold.

The comment now accurately describes that workspace is allocated for > 32 threads instead of >= 32, matching the implementation change.

325-329: Correct fix for single-warp reduction.

The change correctly skips workspace allocation for exactly 32 threads (one warp). Single-warp reductions can use warp-level primitives without shared memory, while multi-warp reductions (> 32 threads) require workspace for inter-warp communication.

The AllReduce::run() and AllReduce::run_hopper() methods both accept an optional red_buf parameter (defaults to nullptr), allowing them to handle both call patterns correctly:

Two arguments (no workspace) for reducing_threads <= 32

Three arguments (with workspace) for reducing_threads > 32

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/op/reduce.cc (1)
174-176: Update the docstring to reflect the new threshold.

The docstring states workspace is allocated for >= 32 threads, but the implementation now uses > 32 (line 325). Please update the documentation to match the new behavior.
📝 Proposed documentation fix
  * - Detects parallel thread splitting from the normalized iterator sum and
  *   emits a call to a templated `tl::AllReduce<...>::run` (or `run_hopper`)
  *   via `builtin::call_extern`. For sufficiently large reducing thread counts
- *   (>= 32) a workspace is allocated via T.AddWorkspace and passed to the
+ *   (> 32) a workspace is allocated via T.AddWorkspace and passed to the
  *   AllReduce call.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a56212d and 581d51f.

📒 Files selected for processing (1)

src/op/reduce.cc

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)

🔇 Additional comments (1)

src/op/reduce.cc (1)

325-329: LGTM! Correct fix for single-warp reduction.

The change correctly prevents workspace allocation for exactly 32 threads (single warp). Single-warp reductions can use warp-level primitives without requiring shared memory workspace, making this allocation unnecessary and potentially wasteful.

Copilot

Pull request overview

This pull request fixes a bug in the shared memory workspace allocation logic for reduction operations. The fix changes the threshold for allocating workspace from >= 32 to > 32 threads, correctly handling the edge case where exactly 32 threads (a single warp on CUDA) are used for reduction.

Key Changes:

Updated the workspace allocation threshold to only allocate when thread count exceeds 32, not when it equals 32
Fixed both the code logic and accompanying documentation comment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

LeiWang1999 previously approved these changes Jan 8, 2026

View reviewed changes

LeiWang1999 requested a review from Copilot January 8, 2026 10:27

Copilot started reviewing on behalf of LeiWang1999 January 8, 2026 10:27 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

pull bot dismissed LeiWang1999’s stale review via a56212d January 8, 2026 14:28

pull bot force-pushed the main branch from b299cfc to a56212d Compare January 8, 2026 14:28

botbw closed this Jan 8, 2026

botbw mentioned this pull request Jan 9, 2026

[bugfix] fix smem alloc for single warp reduce #1643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] fix smem alloc for single warp reduce#1641

[bugfix] fix smem alloc for single warp reduce#1641
botbw wants to merge 0 commit intotile-ai:mainfrom
botbw:main

botbw commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

botbw commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

botbw commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading