[Bugfix] Enable code lowering with producer‑copy‑only program #1168

LeiWang1999 · 2025-10-31T10:17:52Z

This pull request improves the robustness and correctness of warp group register allocation and TMA barrier injection transformations in the codebase. The changes mainly address edge cases in kernel patterns, enhance handling of barrier creation and referencing, and prevent crashes or incorrect code generation in degenerate scenarios.

Robustness improvements for degenerate patterns

In SetMaxNRegInjector, the code now gracefully handles cases where the consumer body is absent in warp-specialized patterns, preventing crashes when only the producer branch exists. [1] [2]

Barrier creation and reference correctness

In BarrierCreationRewriter, the logic is enhanced to ensure the minimum number of barriers matches the maximum index referenced in the kernel, and to append default barriers as needed. This prevents out-of-bounds and missing barrier issues. [1] [2]
The code now computes the minimum required barrier count dynamically by analyzing the kernel body for referenced barrier indices, ensuring that all barriers used are properly created.

Improved handling of TMA operations

In TmaBarrierRewriter, the code now promotes raw integer barrier IDs to get_mbarrier(id) for 1D TMA loads in producer-only kernels, ensuring correct code generation even when no arrive is present and mapping is empty.
The visitor and mutator logic for TMA barrier operations now checks for the presence of mapping before accessing it, preventing crashes in degenerate cases and handling missing mappings gracefully. [1] [2]

Summary by CodeRabbit

Bug Fixes
- Safely handle degenerate warp patterns where an optional consumer may be absent; preserve original control flow when no consumer.
- Guard barrier-related lookups and skip expectations when mappings are missing; promote certain loads to use barriers when needed.
Refactor
- Make barrier creation configurable: ensure a minimum barrier count, fill or preserve entries, and use a default thread count when appending.

…ies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.

…ability. Adjusted formatting and spacing for clarity in barrier handling logic.

github-actions · 2025-10-31T10:18:01Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-10-31T10:18:02Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Handles degenerate warp patterns by making consumer blocks optional in IfThenElse construction, and enhances TMA barrier injection to guard missing mappings, compute a minimum required barrier count, and append/restore barrier entries accordingly.

Changes

Cohort / File(s)	Summary
Degenerate warp pattern handling `src/transform/annotate_warp_group_reg_alloc.cc`	Replace hard ICHECK on optional consumer with conditional flow; construct IfThenElse with only producer branch when consumer is absent, otherwise produce producer+consumer branches; preserve producer-side register injections and final attributes.
TMA barrier lookup safety & promotion `src/transform/inject_tma_barrier.cc`	Guard access to `tma_op_to_barrier_id_` (skip when no mapping); promote unmapped 1D TMA loads to `get_mbarrier(id)` and skip `mbarrier_expect_tx` handling when mapping absent.
Barrier creation parameter expansion `src/transform/inject_tma_barrier.cc`	Extend `BarrierCreationRewriter` constructor with `ensure_min_count` and `default_barrier_thread_count`; compute needed barrier count, build replace mask, preserve/override existing entries, and append extra entries using default thread count.
Barrier count discovery `src/transform/inject_tma_barrier.cc`	Add `GetMbarrierMaxIdxCollector` helper; scan function body to derive minimum required barrier count (ensure_min_count) before creating `BarrierCreationRewriter`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Annotator as annotate_warp_group_reg_alloc
    participant Producer as ProducerBranch
    participant Consumer as ConsumerBranch
    Note over Annotator: Old flow (ICHECK)
    Annotator->>Consumer: assume consumer exists (ICHECK)
    Consumer-->>Annotator: body
    Note over Annotator,Producer: New flow (conditional)
    Annotator->>Producer: always construct producer branch
    alt consumer present
        Annotator->>Consumer: construct consumer branch
    else consumer absent
        Note right of Producer: Construct IfThenElse with only producer branch
    end

sequenceDiagram
    autonumber
    participant Rewriter as TmaBarrierRewriter
    participant Collector as GetMbarrierMaxIdxCollector
    participant Creator as BarrierCreationRewriter
    participant Map as tma_op_to_barrier_id_
    Note over Rewriter: Scan function for get_mbarrier calls
    Rewriter->>Collector: collect max barrier idx -> ensure_min_count
    Rewriter->>Map: lookup(call_ref)
    alt mapping exists
        Rewriter->>Creator: instantiate with restore ids and ensure_min_count
        Creator->>Creator: build/replace/append barrier entries (use existing ids)
    else no mapping
        Rewriter->>Rewriter: promote 1D TMA load -> get_mbarrier(id)
        Rewriter->>Creator: instantiate with ensure_min_count and default thread count
        Creator->>Creator: append barriers for needed slots (use default thread count)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to src/transform/inject_tma_barrier.cc for correct handling of replace masks, indexing, and edge cases when appending barrier entries.
Verify src/transform/annotate_warp_group_reg_alloc.cc paths produce valid IR both when consumer is present and absent.
Check interactions between computed ensure_min_count and the BarrierCreationRewriter constructor parameters.

Possibly related PRs

[Refactor] Refactor barrier management #744: Overlaps in TMA/barrier lowering and get_mbarrier/barrier creation logic; likely touches similar code paths in inject_tma_barrier.cc.

Poem

🐰
I hop through IR, with careful cheer,
If no consumer, I keep the path clear,
Barriers counted, mapped or not,
Threads appended in the needed slot,
A tiny rabbit's tidy compiler plot.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "[Bugfix] Enable code lowering with producer‑copy‑only program" directly relates to the main objective of the changeset. The summary of changes addresses handling degenerate kernel patterns where the consumer body is absent (producer-only kernels), spanning warp group register allocation, barrier creation, and TMA barrier handling. The title accurately reflects that this PR fixes code generation/lowering issues for producer-only program patterns, making it concise, clear, and specific enough for a reviewer to understand the primary change without reading the full diff.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

LeiWang1999 · 2025-10-31T10:20:39Z

for issue #1166

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10911e2 and 1cb0c43.

📒 Files selected for processing (2)

src/transform/annotate_warp_group_reg_alloc.cc (2 hunks)
src/transform/inject_tma_barrier.cc (5 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/annotate_warp_group_reg_alloc.cc
src/transform/inject_tma_barrier.cc

🧬 Code graph analysis (1)

src/transform/inject_tma_barrier.cc (2)

tilelang/language/builtin.py (2)

create_list_of_mbarrier (29-58)

get_mbarrier (61-70)

src/target/codegen_cuda.cc (16)

VisitExpr_ (887-1134)

VisitExpr_ (887-887)

VisitExpr_ (1136-1157)

VisitExpr_ (1136-1136)

VisitExpr_ (1159-1180)

VisitExpr_ (1159-1159)

VisitExpr_ (1385-2317)

VisitExpr_ (1385-1385)

VisitExpr_ (2460-2474)

VisitExpr_ (2460-2460)

VisitExpr_ (2476-2544)

VisitExpr_ (2476-2477)

VisitExpr_ (2546-2694)

VisitExpr_ (2546-2547)

VisitExpr_ (2760-2763)

VisitExpr_ (2760-2761)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

🔇 Additional comments (1)

src/transform/annotate_warp_group_reg_alloc.cc (1)

155-166: Consumer-optional branch logic looks good

The conditional construction of the IfThenElse keeps register annotations intact while avoiding crashes on producer-only kernels. Nice resilience tweak.

src/transform/inject_tma_barrier.cc

LeiWang1999 · 2025-10-31T10:34:05Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-31T10:36:58Z

src/transform/inject_tma_barrier.cc

  PrimExpr VisitExpr_(const CallNode *op) {
    if (op->op.same_as(create_list_of_mbarrier())) {
-      std::vector<bool> tmp_(op->args.size(), false);
-      Array<PrimExpr> new_args;
+      size_t cur_n = op->args.size();
+      size_t need_n =
+          std::max<size_t>(cur_n, static_cast<size_t>(ensure_min_count_));
+
+      std::vector<bool> replace(cur_n, false);
      for (auto &id : restore_barrier_ids_) {
-        tmp_[id] = true;
+        if (id >= 0 && static_cast<size_t>(id) < replace.size()) {
+          replace[id] = true;
+        }
      }

-      for (size_t i{0}; i < op->args.size(); ++i) {
-        if (tmp_[i]) {
+      Array<PrimExpr> new_args;
+      new_args.reserve(need_n);
+
+      // Preserve/override existing entries
+      for (size_t i{0}; i < cur_n; ++i) {
+        if (replace[i]) {
          new_args.push_back(producer_thread_extent_);
        } else {
          new_args.push_back(op->args[i]);
        }
      }
+      // Append additional barriers if required
+      for (size_t i = cur_n; i < need_n; ++i) {
+        new_args.push_back(default_barrier_thread_count_);
+      }


Restore counts for new barriers beyond original list

When BarrierCreationRewriter grows create_list_of_mbarrier to ensure_min_count, replace is sized with cur_n and only covers the original arguments. Any restore_barrier_ids_ greater than or equal to the original length are ignored, so the subsequently appended entries (lines appending default_barrier_thread_count_) are always initialized with the default count 1 even if they were marked for restoration. This means a barrier that was referenced in the kernel but not originally created will be created expecting a single arriving thread instead of producer_thread_extent_, leading to premature releases or deadlock when multiple threads participate. Consider sizing replace to need_n and using it when appending so restored barriers receive the correct arrival count.

Useful? React with 👍 / 👎.

… appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1cb0c43 and 0e5f014.

📒 Files selected for processing (1)

src/transform/inject_tma_barrier.cc (5 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/inject_tma_barrier.cc

🧬 Code graph analysis (1)

src/transform/inject_tma_barrier.cc (1)

tilelang/language/builtin.py (2)

create_list_of_mbarrier (29-58)

get_mbarrier (61-70)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

🔇 Additional comments (5)

src/transform/inject_tma_barrier.cc (5)

296-306: LGTM! Proper guard for missing mappings.

The added check at line 299 prevents accessing tma_op_to_barrier_id_ when the mapping doesn't exist, handling producer-only kernels gracefully where no ptx_arrive_barrier creates the mapping.

340-348: LGTM! Constructor parameters enable minimum barrier count enforcement.

The new parameters ensure_min_count and default_barrier_thread_count provide the necessary control to ensure referenced barriers are created, addressing degenerate producer-only kernels.

350-389: LGTM! Previous critical issue with restored barrier counts is fixed.

The implementation now correctly handles restored barrier IDs beyond the original list:

replace vector is sized to need_n (line 358), covering both existing and newly appended barriers

The append loop (lines 377-383) checks replace[i] for each new barrier

Restored barriers in the extended range now receive producer_thread_extent_ instead of the default count

This addresses the critical concern from the previous review.

508-540: LGTM! Proper handling of unmapped 1D TMA loads.

The logic correctly handles producer-only kernels where no arrive creates a mapping:

Guards against missing mappings (line 510)

Promotes constant integer barrier IDs to get_mbarrier(id) calls for 1D TMA loads (lines 518-526)

Falls back safely for non-constant barrier IDs

Note: This only handles IntImmNode barrier IDs, consistent with the GetMbarrierMaxIdxCollector design.

541-545: LGTM! Consistent guard for mbarrier_expect_tx.

The added check prevents accessing tma_op_to_barrier_id_ when the mapping doesn't exist, consistent with the handling for tma_load and necessary for producer-only kernels.

coderabbitai · 2025-10-31T12:13:14Z

src/transform/inject_tma_barrier.cc

+    // Compute the minimum number of barriers actually referenced in the body
+    // after TMA barrier rewrites (e.g., get_mbarrier(0) inserted for TMA).
+    struct GetMbarrierMaxIdxCollector : public StmtExprVisitor {
+      int max_idx{-1};
+      void VisitExpr_(const CallNode *op) final {
+        if (op->op.same_as(get_mbarrier())) {
+          if (op->args.size() == 1) {
+            if (const auto *imm = op->args[0].as<IntImmNode>()) {
+              max_idx = std::max(max_idx, static_cast<int>(imm->value));
+            }
+          }
+        }
+        StmtExprVisitor::VisitExpr_(op);
+      }
+    };
+
+    GetMbarrierMaxIdxCollector max_idx_collector;
+    max_idx_collector(f->body);
+    int ensure_min_count = max_idx_collector.max_idx + 1; // 0-based -> count


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that only constant barrier IDs are used in practice.

The GetMbarrierMaxIdxCollector only captures constant integer barrier IDs (line 439). If the codebase uses dynamic barrier IDs—computed from variables or loop indices—those won't be counted, potentially leading to insufficient barrier creation.

Run the following script to check for non-constant barrier ID usage:

🏁 Script executed:

#!/bin/bash # Description: Search for get_mbarrier calls with non-constant arguments # to verify the assumption that only constant IDs are used. # Search for get_mbarrier calls and show context rg -nP 'get_mbarrier\(' --type=cc -A 2 -B 2 # Also check test files for dynamic usage patterns rg -nP 'get_mbarrier\(' --type=py -A 2 -B 2 -g '**/*test*.py'

Length of output: 9092

The script output reveals critical findings: dynamic barrier IDs ARE used in practice.

🏁 Script executed:

#!/bin/bash # Find pass registration in inject_tma_barrier.cc fd -name "inject_tma_barrier.cc" --type f -exec grep -l "TVM_REGISTER\|CreatePrimFuncPass" {} \; # Check for registration details rg -n "TVM_REGISTER.*inject_tma_barrier\|CreatePrimFuncPass" -A 5 -B 5 2>/dev/null || echo "Not found with grep, trying broader search..." # Look for the full registration rg -nP "InjectTMABarrier|inject_tma_barrier" --type=cc -A 3 -B 3 -g "*inject_tma_barrier*"

Length of output: 303

🏁 Script executed:

#!/bin/bash # Find inject_tma_barrier.cc file fd inject_tma_barrier.cc # Search for pass registration in that file rg "TVM_REGISTER|CreatePrimFuncPass" -A 5 -B 5 inject_tma_barrier.cc 2>/dev/null || true # Look for how the pass handles expressions rg -nP "GetMbarrierMaxIdxCollector|VisitExpr" inject_tma_barrier.cc -A 2 -B 2 2>/dev/null || true # Check the end of the file for pass registration tail -50 $(fd inject_tma_barrier.cc)

Length of output: 1847

🏁 Script executed:

#!/bin/bash # Find where ensure_min_count is used rg -n "ensure_min_count" src/transform/inject_tma_barrier.cc -B 2 -A 5 # Find BarrierCreationRewriter to understand impact rg -n "BarrierCreationRewriter" src/transform/inject_tma_barrier.cc -B 3 -A 10

Length of output: 3343

Fix GetMbarrierMaxIdxCollector to handle dynamic barrier ID expressions.

The collector currently only recognizes constant integer barrier IDs via as<IntImmNode>() at line 439. However, test files demonstrate widespread use of dynamic barrier IDs like k % 3 and k % 3 + 3 within loops (see test_tilelang_transform_warp_specialized.py lines 83, 85, 89, etc.). These dynamic expressions are not IntImmNode and are therefore not counted, causing ensure_min_count to be underestimated. This results in create_list_of_mbarrier() allocating fewer barriers than actually needed, leading to runtime access violations.

You must enhance the collector to compute bounds on dynamic expressions (using arith::Analyzer which is already available as analyzer) to determine the maximum possible barrier ID, ensuring sufficient barrier allocation.

🤖 Prompt for AI Agents

In src/transform/inject_tma_barrier.cc around lines 432-450, the GetMbarrierMaxIdxCollector only handles IntImmNode constants; update VisitExpr_ so when get_mbarrier() args[0] is not an IntImmNode it uses the existing arith::Analyzer (named analyzer) to compute an upper bound for the expression (e.g., via analyzer->int_set(expr) or const-int-bound helper) and updates max_idx with that bound; if the analyzer reports a finite maximum use that value, and if the bound is unbounded/unknown set max_idx to a conservative safe upper value (e.g., 1024) to avoid under-allocation, then continue traversal as before.

…ai#1168) * bugfix * lint fix * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns. * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic. * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.

* [Test] Add cp async to avoid register spill * [BugFix] GQA fwd and bwd - Fix the undefined behavior of -inf in acc_s - Fix the causal loop range in varlen scenario * [TMA] Move on to TMA and locate the register spill issue * [Debug] Not the reason of zero-assignment. Probably the combination of Parallel op & conditional qkT * [Debug] The SIMT copy in producer occupies too many registers * [BugFix] Use 3D lse and delta to avoid illegal instruction * [Perf] Relaxed order for dQ and SIMT store for dKdV * [Feat] For atomic add version * [Lint] * [Bugfix] Enable code lowering with producer‑copy‑only program (#1168) * bugfix * lint fix * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns. * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic. * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions. * [Bugfix] Support 16bits shfl_sync (#1169) * Add type-safe warp shuffle helpers for 16-bit float types in common.h - Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`. - Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations. - Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability. * lint fix * [Testing] Move TMA 1D and test for its functionality (#1167) * [Testing] Move TMA 1D and test for its functionality * [Lint] * [Refactor]: Change the params in pytest to avoid oom error during ci (#1170) * [Refactor]: Change the params in pytest to avoid oom error during ci * format * fix * Update test_example_cast.py * Update parameters in test_example_cast * Update test_example_flash_attention.py * update * format * fix * fix * format * [Bugfix] Fix tvm import path for editable build (#1172) * [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) * remove debug print * pipeline fix * use the correct buffer access scope * rs support * warp warpgroup_fence_operand * fix * fp8 dtype ptx enhance * mma fix * TCGEN05 Interface * tcgen05 support * rebase * update * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors. * lint fix * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module. * wgmma fix --------- Co-authored-by: Zhiwen Mo <[email protected]> * [Language] Add Correctness and performance check scripts for V2 (#1174) * fix * lint fix * fix * lint fix * fix * upd * [Bugfix] Legalize Datatype for mma intrinisc codegen (#1179) * fix * lint fix * Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands. * [Perf] Add layout and use_tma to boost performance * [Lint] * [Note] --------- Co-authored-by: Lei Wang <[email protected]> Co-authored-by: Yuqi Dong <[email protected]> Co-authored-by: Zhiwen Mo <[email protected]>

…ai#1168) * bugfix * lint fix * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns. * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic. * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.

* [Test] Add cp async to avoid register spill * [BugFix] GQA fwd and bwd - Fix the undefined behavior of -inf in acc_s - Fix the causal loop range in varlen scenario * [TMA] Move on to TMA and locate the register spill issue * [Debug] Not the reason of zero-assignment. Probably the combination of Parallel op & conditional qkT * [Debug] The SIMT copy in producer occupies too many registers * [BugFix] Use 3D lse and delta to avoid illegal instruction * [Perf] Relaxed order for dQ and SIMT store for dKdV * [Feat] For atomic add version * [Lint] * [Bugfix] Enable code lowering with producer‑copy‑only program (tile-ai#1168) * bugfix * lint fix * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns. * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic. * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions. * [Bugfix] Support 16bits shfl_sync (tile-ai#1169) * Add type-safe warp shuffle helpers for 16-bit float types in common.h - Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`. - Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations. - Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability. * lint fix * [Testing] Move TMA 1D and test for its functionality (tile-ai#1167) * [Testing] Move TMA 1D and test for its functionality * [Lint] * [Refactor]: Change the params in pytest to avoid oom error during ci (tile-ai#1170) * [Refactor]: Change the params in pytest to avoid oom error during ci * format * fix * Update test_example_cast.py * Update parameters in test_example_cast * Update test_example_flash_attention.py * update * format * fix * fix * format * [Bugfix] Fix tvm import path for editable build (tile-ai#1172) * [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (tile-ai#986) * remove debug print * pipeline fix * use the correct buffer access scope * rs support * warp warpgroup_fence_operand * fix * fp8 dtype ptx enhance * mma fix * TCGEN05 Interface * tcgen05 support * rebase * update * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors. * lint fix * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module. * wgmma fix --------- Co-authored-by: Zhiwen Mo <[email protected]> * [Language] Add Correctness and performance check scripts for V2 (tile-ai#1174) * fix * lint fix * fix * lint fix * fix * upd * [Bugfix] Legalize Datatype for mma intrinisc codegen (tile-ai#1179) * fix * lint fix * Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands. * [Perf] Add layout and use_tma to boost performance * [Lint] * [Note] --------- Co-authored-by: Lei Wang <[email protected]> Co-authored-by: Yuqi Dong <[email protected]> Co-authored-by: Zhiwen Mo <[email protected]>

LeiWang1999 added 4 commits October 31, 2025 16:27

bugfix

3717ec3

lint fix

9d10de9

Enhance warp group register allocation to handle missing consumer bod…

c3212aa

…ies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.

Refactor VisitExpr_ method in inject_tma_barrier.cc for improved read…

1cb0c43

…ability. Adjusted formatting and spacing for clarity in barrier handling logic.

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

src/transform/inject_tma_barrier.cc Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 31, 2025

View reviewed changes

Update barrier handling in inject_tma_barrier.cc to accommodate newly…

0e5f014

… appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

LeiWang1999 merged commit 7a80b6d into tile-ai:main Oct 31, 2025
4 of 6 checks passed

kurisu6912 mentioned this pull request Nov 3, 2025

[Fix] fix type imcompatible error in #1115 #1180

Merged

This was referenced Nov 4, 2025

[Feat] Add swap like grammar in tuple assignment #1185

Merged

[Fix] Remove unsupported type params #1186

Merged

[Feat] Add support for T.serial with step and negative step #1188

Merged

[Feat] Add A Pass to Handle Negative Index #1192

Merged

This was referenced Nov 10, 2025

[Fix] Fix buffer re-import typo in tilelang.languge #1214

Merged

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Enable code lowering with producer‑copy‑only program #1168

[Bugfix] Enable code lowering with producer‑copy‑only program #1168

Uh oh!

LeiWang1999 commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

LeiWang1999 commented Oct 31, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

LeiWang1999 commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 31, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Bugfix] Enable code lowering with producer‑copy‑only program #1168

[Bugfix] Enable code lowering with producer‑copy‑only program #1168

Uh oh!

Conversation

LeiWang1999 commented Oct 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Robustness improvements for degenerate patterns

Barrier creation and reference correctness

Improved handling of TMA operations

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

coderabbitai bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

LeiWang1999 commented Oct 31, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeiWang1999 commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2025 •

edited

Loading