[Feature] Add 1D TMA support #761

tzj-fxz · 2025-08-26T11:00:15Z

[Feature] Add 1D TMA support

Check the contiguous conditions of 1D TMA copy
Add new interface and params order of tma_load and tma_store call
Add 1D tma_store interface in sm90 template
Add elementwise kernel for 1D TMA example

Summary by CodeRabbit

New Features
- New GPU example demonstrating a tiled elementwise addition with CLI, validation, and float32 support.
- New sized TMA store API for direct sized global writes.
Improvements
- Added a 1D TMA transfer path and refined barrier handling for 1D TMA loads to improve correctness/performance.
- Updated third-party submodule.
Tests
- Added tests for the elementwise TMA example and a BF16 MXFP4 Hopper dequant GEMM example.
Documentation
- Clarified GDN example README and added acknowledgements.

- Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example

coderabbitai · 2025-08-26T11:00:22Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a TileLang elementwise-add example and test; implements a 1D TMA copy fast-path and sized tma_store overload for CUDA SM90; adjusts barrier insertion and byte-accounting to support 1D TMA loads; threads GEMM buffer variable collection into lowering via a new LowerArgs field; updates cutlass submodule and docs.

Changes

Cohort / File(s)	Summary of Changes
New example & test `examples/elementwise/example_elementwise_add_tma_1d.py`, `examples/elementwise/test_example_elementwise.py`	Adds a tiled 2D elementwise-add TileLang example with host runner/CLI and a unit test that invokes its main.
Copy op: 1D TMA path & API change `src/op/copy.cc`	Adds a 1D TMA copy fast-path when global/shared buffers are contiguous and in-bounds; computes offsets/strides/byte sizes, emits sized tma_load/tma_store with eviction_policy, gates with barrier checks, and updates Copy op registration inputs from 4→5.
CUDA SM90 TMA helper `src/tl_templates/cuda/copy_sm90.h`	Adds templated overload `tma_store(void gmem_ptr, void smem_ptr, uint32_t size)` to perform sized cp.async.bulk.global.shared::cta.bulk_group stores with cache_hint; preserves existing descriptor APIs.
TMA barrier injection & rewrite `src/transform/inject_tma_barrier.cc`, `src/transform/warp_specialized_rewriter.cc`	Detects 1D `tma_load` calls (shared `tvm_access_ptr` in `args[0]`) and for them adjusts bulk_copy_bytes accounting and shifts barrier insertion index to 2; retains previous behavior for multi-D loads.
Lowering: GEMM buffer collection & LowerArgs change `src/transform/lower_tile_op.cc`, `src/op/op.h`	Adds `BufferGemmCollector` to collect GEMM-related buffer vars, threads them into `LowerTileOpPass`, and extends `LowerArgs` with new `Array<Var> buffer_var_gemm` passed into tile op lowering.
Submodule & docs `3rdparty/cutlass` (submodule), `examples/gdn/README.md`	Updates cutlass submodule pointer; edits README wording, requirements list, and acknowledgements.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Host
  participant Kernel as GPU Kernel
  participant Copy as Copy::LowerBulkCopy
  participant TMA as TMA Engine
  participant Barrier as mbarrier

  Host->>Kernel: launch tile grid
  Kernel->>Copy: request bulk copy (gmem <-> smem)
  alt Contiguous 1D & in-bounds
    Note right of Copy #bfe0c9: detect 1D TMA (args[0] is shared ptr)
    Copy->>Barrier: ensure/insert barrier (arg idx = 2)
    Copy->>TMA: tma_load / tma_store (1D, sized, eviction_policy)
    TMA-->>Copy: transfer completed
  else Fallback
    Copy->>Copy: descriptor-based bulk-copy lowering (multi-D)
  end
  Kernel->>Kernel: compute tile (C = A + B)
  Kernel-->>Host: write back results / complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[Refactor] Refactor barrier management #744 — Strongly related: overlapping changes to 1D TMA copy paths, tma/barrier insertion, and related lowering logic.
[Refactor] Merge ThreadPartialSync and ThreadStorageSync #741 — Related: touches thread-sync/thread-sync refactors that interact with barrier/gating changes introduced here.
cutlass v4.2.0 supporting cuda 13 #760 — Related: updates the same 3rdparty/cutlass submodule pointer used in this change.

Suggested reviewers

LeiWang1999

Poem

I hop through tiles with tiny paws,
I count the bytes and mind the laws.
A 1D TMA clears the trail,
Barriers shifted, transfers sail.
I twitch my nose and cry, “All passed!” 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-08-26T11:00:24Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist

Summary of Changes

Hello @tzj-fxz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for 1D Tensor Memory Accelerator (TMA) operations, specifically optimizing contiguous memory copies between global and shared memory. The changes enable more efficient data transfers for element-wise operations by leveraging 1D TMA capabilities, improving performance on compatible hardware.

Highlights

1D TMA Copy Implementation: Added core logic in src/op/copy.cc to perform 1D TMA copies when both global and shared memory regions are contiguous, ensuring efficient bulk data transfers.
TMA Interface Enhancements: Updated the tma_load and tma_store interfaces to accommodate 1D TMA operations, including a new tma_store overload in src/tl_templates/cuda/copy_sm90.h for direct memory pointer usage.
Contiguity and Bounds Checks: Implemented robust checks within the LowerBulkCopy function to verify the contiguity of memory regions and prevent out-of-bounds access for 1D TMA operations.
TMA Barrier Injection Logic Updates: Modified the TMA barrier injection and warp specialization rewriters to correctly handle the new 1D TMA load call signatures, ensuring proper synchronization.
New Example and Test Case: Introduced a new Python example (examples/elementwise/example_elementwise_add_tma_1d.py) demonstrating an element-wise addition kernel utilizing 1D TMA, along with an updated test suite.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for 1D TMA copies, including contiguity checks, new tma_load and tma_store interfaces, and an example kernel. The implementation is functionally correct, but there are several areas with significant code duplication. I've pointed out opportunities to refactor repeated logic for contiguity checks, out-of-bounds checks, element counting, and 1D TMA load detection into helper functions. These changes would improve code maintainability and readability.

gemini-code-assist · 2025-08-26T11:02:50Z

src/op/copy.cc

+  {
+    // Currently we check the shared tensor before layout remapping, so we skip
+    // the layout definition check
+    bool shared_is_contiguous = true;
+    bool shared_not_full_dim_encounter = false;
+    for (ssize_t i = shared_range.size() - 1; i >= 0; --i) {
+      if (!shared_not_full_dim_encounter) {
+        if (!analyzer->CanProve(shared_range[i]->extent ==
+                                    shared_tensor_before_remap->shape[i] &&
+                                shared_range[i]->min == 0)) {
+          shared_not_full_dim_encounter = true;
+        }
+      } else {
+        if (!analyzer->CanProve(shared_range[i]->extent == 1)) {
+          shared_is_contiguous = false;
+          break;
+        }
+      }
+    }
+    // Currently we do not check the empty stride of global tensor
+    bool global_is_contiguous = true;
+    bool global_not_full_dim_encounter = false;
+    for (ssize_t i = global_range.size() - 1; i >= 0; --i) {
+      if (!global_not_full_dim_encounter) {
+        if (!analyzer->CanProve(global_range[i]->extent ==
+                                    global_tensor->shape[i] &&
+                                global_range[i]->min == 0)) {
+          global_not_full_dim_encounter = true;
+        }
+      } else {
+        if (!analyzer->CanProve(global_range[i]->extent == 1)) {
+          global_is_contiguous = false;
+          break;
+        }
+      }
+    }
+    // Ensure there is element match and no OOB
+    PrimExpr shared_elements = 1;
+    for (size_t i = 0; i < shared_range.size(); i++) {
+      shared_elements *= shared_range[i]->extent;
+    }
+    PrimExpr global_elements = 1;
+    for (size_t i = 0; i < global_range.size(); i++) {
+      global_elements *= global_range[i]->extent;
+    }
+    bool element_match =
+        analyzer->CanProveEqual(shared_elements, global_elements);
+    bool no_oob = true;
+    for (size_t i = 0; i < shared_range.size(); i++) {
+      if (!analyzer->CanProve(shared_range[i]->min + shared_range[i]->extent <=
+                              shared_tensor_before_remap->shape[i])) {
+        no_oob = false;
+        break;
+      }
+    }
+    for (size_t i = 0; i < global_range.size(); i++) {
+      if (!analyzer->CanProve(global_range[i]->min + global_range[i]->extent <=
+                              global_tensor->shape[i])) {
+        no_oob = false;
+        break;
+      }
+    }
+    // Add 1D TMA copy
+    if (shared_is_contiguous && global_is_contiguous && element_match &&
+        no_oob) {
+      PrimExpr elements = analyzer->Simplify(shared_elements);
+      PrimExpr shared_addr = shared_tensor_before_remap.access_ptr(
+          is_load ? 2 : 1, DataType::Handle(), 1, offset, elements);
+      PrimExpr global_addr = global_tensor.access_ptr(
+          is_load ? 1 : 2, DataType::Handle(), 1, global_offset, elements);
+      Stmt tma_copy;
+      if (is_load) {
+        // the zero is a placeholder for mbarrier id
+        tma_copy =
+            Evaluate(Call(DataType::Handle(), tma_load(),
+                          {shared_addr, global_addr, 0,
+                           elements * shared_tensor_before_remap->dtype.bytes(),
+                           this->eviction_policy}));
+      } else {
+        tma_copy =
+            Evaluate(Call(DataType::Handle(), tma_store(),
+                          {global_addr, shared_addr,
+                           elements * shared_tensor_before_remap->dtype.bytes(),
+                           this->eviction_policy}));
+      }
+      tma_copy = IfThenElse(EQ(T.thread_var, T.thread_bounds->min), tma_copy);
+      return tma_copy;
+    }
+  }


This new block for handling 1D TMA copy is quite large and contains duplicated logic. To improve readability and maintainability, consider refactoring parts of it into helper functions.
Specifically:

The contiguity checks for shared and global memory (lines 809-824 and 826-841) are identical and can be extracted into a function like bool IsContiguous(...).

The element count calculations (lines 843-850) are also duplicated and could be a function PrimExpr CalculateElements(...).

The out-of-bounds checks (lines 854-867) are another candidate for a helper function bool CheckOOB(...).
Extracting these would make the LowerBulkCopy function much cleaner and easier to follow.

gemini-code-assist · 2025-08-26T11:02:50Z

src/transform/inject_tma_barrier.cc

+      auto arg0 = op->args[0].as<Call>();
+      auto is_1d_tma_load =
+          arg0 && !arg0.value()->op.same_as(create_tma_descriptor());
+      if (is_1d_tma_load) {
+        new_args.Set(2, barrier_id);
+      } else {
+        new_args.Set(1, barrier_id);
+      }


The logic to detect a 1D TMA load is duplicated across multiple places in this file (e.g., TmaExpectTxRewriter::VisitExpr_) and also in warp_specialized_rewriter.cc. Consider creating a helper function to encapsulate this check, for example:

// In a common header or utility file inline bool Is1DTmaLoad(const CallNode* call) { if (!call->op.same_as(tma_load())) return false; auto arg0 = call->args[0].as<Call>(); return arg0 && !arg0.value()->op.same_as(create_tma_descriptor()); }

This would improve code reuse and make the logic easier to maintain.

gemini-code-assist · 2025-08-26T11:02:50Z

src/transform/warp_specialized_rewriter.cc

+      auto arg0 = call->args[0].as<Call>();
+      // Check if this is a 1D TMA load
+      auto is_1d_tma_load =
+          arg0 && !arg0.value()->op.same_as(create_tma_descriptor()) &&
+          call->op.same_as(tma_load());
+      if (is_1d_tma_load) {
+        call.CopyOnWrite()->args.Set(2, mbar);
+      } else {
+        Call access_ptr = Downcast<Call>(call->args[2]);
+        ICHECK(access_ptr->op.same_as(builtin::tvm_access_ptr()));
+        call.CopyOnWrite()->args.Set(1, mbar);
+      }


This logic for detecting a 1D TMA load is duplicated in src/transform/inject_tma_barrier.cc. It would be beneficial to refactor this into a shared helper function to avoid code repetition and improve maintainability. Also, the call->op.same_as(tma_load()) check is redundant here as it's already covered by the outer if condition.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

examples/elementwise/test_example_elementwise.py (1)

10-12: Gate the 1D TMA example on CUDA/arch availability or reduce problem size.

Running example main() inside tests can be heavy and arch-dependent (SM90 path). If CI lacks Hopper, this may flake. Consider skipping when CUDA is unavailable or arch < 90, or reduce sizes to be fast.

I can add a small guard using torch.cuda.is_available() and sm>=90 detection, falling back to a tiny tensor or skipping.
src/transform/warp_specialized_rewriter.cc (1)
324-336: Harden 1D TMA barrier rewrite with bounds check; minor nit on redundant check.

Before setting args[2] for 1D TMA, assert the arg list is long enough to avoid out-of-bounds in malformed IR.

Minor: the extra call->op.same_as(tma_load()) in is_1d_tma_load is redundant in this branch.

Suggested diff:
-      auto is_1d_tma_load =
-          arg0 && !arg0.value()->op.same_as(create_tma_descriptor()) &&
-          call->op.same_as(tma_load());
+      auto is_1d_tma_load =
+          arg0 && !arg0.value()->op.same_as(create_tma_descriptor());
       if (is_1d_tma_load) {
-        call.CopyOnWrite()->args.Set(2, mbar);
+        ICHECK_GE(static_cast<int>(call->args.size()), 3)
+            << "1D tma_load expects at least 3 arguments";
+        call.CopyOnWrite()->args.Set(2, mbar);
       } else {
         Call access_ptr = Downcast<Call>(call->args[2]);
         ICHECK(access_ptr->op.same_as(builtin::tvm_access_ptr()));
         call.CopyOnWrite()->args.Set(1, mbar);
       }
src/transform/inject_tma_barrier.cc (1)
165-175: Small cleanup: redundant op check and add arg-size assertion for safety.

Redundant op->op.same_as(tma_load()) inside a tma_load branch.

Consider asserting arg count before Set(2, …) to keep diagnostics crisp if IR is malformed.

Proposed tweak:
-      bool is_1d_tma_load =
-          arg0 && !arg0.value()->op.same_as(create_tma_descriptor()) &&
-          op->op.same_as(tma_load());
+      bool is_1d_tma_load =
+          arg0 && !arg0.value()->op.same_as(create_tma_descriptor());
       visited_tma_load_ = true;
       Array<PrimExpr> new_args = op->args;
-      new_args.Set(is_1d_tma_load ? 2 : 1,
+      if (is_1d_tma_load) {
+        ICHECK_GE(static_cast<int>(new_args.size()), 3);
+      }
+      new_args.Set(is_1d_tma_load ? 2 : 1,
                    Call(DataType::Handle(), get_mbarrier(),
                         {IntImm(DataType::Int(32), 0)}));
src/op/copy.cc (2)
872-876: Potential mismatch when shared tensor is remapped; 1D path uses pre-remap buffer for access_ptr.

shared_addr is built from shared_tensor_before_remap, while elsewhere (e.g., Conv2DIm2ColOp) you honor T.buffer_remap for address materialization. If a layout remap exists (even “linear” layout), mixing pre/post-remap buffers can cause address divergence or inhibit downstream assumptions.

Two safe options:

Gate the 1D fast-path on “no remap” (i.e., !shared_layout.defined()), falling back otherwise; or

Materialize the address from the remapped buffer and prove the offset equivalence.

If choosing Option B, minimally switch to the remapped buffer for address materialization while keeping the contiguity checks on the pre-remap shape:
-      PrimExpr shared_addr = shared_tensor_before_remap.access_ptr(
-          is_load ? 2 : 1, DataType::Handle(), 1, offset, elements);
+      auto smem_buf =
+          T.buffer_remap.count(shared_tensor_before_remap)
+              ? T.buffer_remap[shared_tensor_before_remap]
+              : shared_tensor_before_remap;
+      PrimExpr shared_addr = smem_buf.access_ptr(
+          is_load ? 2 : 1, DataType::Handle(), 1, offset, elements);
Please confirm whether T.layout_map commonly annotates a linear/swizzled layout for shared in your 1D use cases. If so, we should either enforce the “no remap” guard or add a formal proof (or unit test) that the computed offset remains consistent across the remap.

1332-1337: ✅ tl.copy API Change Verified: All Call Sites Updated

Ran a project-wide search for Op.get("tl.copy"); the only hit is in tilelang/language/copy.py, where tir.call_intrin("handle", Op.get("tl.copy"), src, dst, coalesced_width, disable_tma, eviction_policy) correctly passes 5 inputs.

No other occurrences of 4-argument tl.copy calls were found in the C++ or Python codebase.

No breaking-callers remain.

• Optional: Add a brief upgrade note under docs/changelog.md (or equivalent) to flag that tl.copy now requires a fifth eviction_policy argument, helping third-party codegen/transform authors adapt.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e0cf5fe and 31c66b8.

📒 Files selected for processing (6)

examples/elementwise/example_elementwise_add_tma_1d.py (1 hunks)
examples/elementwise/test_example_elementwise.py (1 hunks)
src/op/copy.cc (2 hunks)
src/tl_templates/cuda/copy_sm90.h (1 hunks)
src/transform/inject_tma_barrier.cc (3 hunks)
src/transform/warp_specialized_rewriter.cc (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

src/transform/warp_specialized_rewriter.cc (3)

src/transform/inject_tma_barrier.cc (2)

call (63-78)

call (63-63)

tilelang/language/builtin.py (2)

create_tma_descriptor (55-64)

tma_load (67-76)

tilelang/language/tir/op.py (1)

tvm_access_ptr (650-675)

examples/elementwise/test_example_elementwise.py (2)

examples/elementwise/example_elementwise_add_tma_1d.py (1)

main (35-51)

examples/elementwise/example_elementwise_add.py (1)

main (62-82)

src/op/copy.cc (3)

src/tl_templates/cuda/copy_sm90.h (6)

tma_load (18-27)

tma_load (43-60)

tma_load (64-82)

tma_load (86-104)

tma_load (107-126)

tma_load (130-150)

src/op/copy.h (1)

Copy (86-283)

tilelang/language/copy.py (1)

copy (84-152)

examples/elementwise/example_elementwise_add_tma_1d.py (5)

tilelang/env.py (1)

disable_cache (232-233)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/language/allocate.py (2)

alloc_shared (20-35)

alloc_fragment (52-63)

tilelang/language/copy.py (1)

copy (84-152)

tilelang/language/parallel.py (1)

Parallel (8-28)

src/transform/inject_tma_barrier.cc (4)

src/transform/lower_hopper_intrin.cc (2)

call (102-132)

call (102-102)

src/tl_templates/cuda/copy_sm90.h (6)

tma_load (18-27)

tma_load (43-60)

tma_load (64-82)

tma_load (86-104)

tma_load (107-126)

tma_load (130-150)

tilelang/language/builtin.py (3)

tma_load (67-76)

create_tma_descriptor (55-64)

get_mbarrier (43-52)

tilelang/language/tir/op.py (1)

tvm_access_ptr (650-675)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-amd
GitHub Check: build-test-nvidia

🔇 Additional comments (4)

examples/elementwise/example_elementwise_add_tma_1d.py (2)
6-6: Check that tilelang.disable_cache() is available at module scope.

In the current API, disable_cache is shown as an Env method. If tilelang.disable_cache() is a convenience wrapper, great; otherwise this may raise at import. If you’ve seen this pattern elsewhere in the repo it’s fine; just calling it out.

13-32: Enforce 1D TMA by using a 1×N tile

To trigger the new 1D contiguous TMA lowering (instead of the default 2D path), change the shared/fragment allocations, the parallel loop, and the default block_M to 1. Here’s a minimal patch:

• File: examples/elementwise/example_elementwise_add_tma_1d.py
• Lines ~27–30: switch (block_M, block_N) → (1, block_N) in T.alloc_shared/T.alloc_fragment
• Line ~31: change T.Parallel(block_M, block_N) → T.Parallel(1, block_N) and index with [0, local_x]
• Line ~47: update default config from {"block_M": 128, "block_N": 128, "threads": 128} to {"block_M": 1, "block_N": 128, "threads": 128}
@@ examples/elementwise/example_elementwise_add_tma_1d.py:27
-            A_shared = T.alloc_shared((block_M, block_N), in_dtype)
-            B_shared = T.alloc_shared((block_M, block_N), in_dtype)
-            C_local  = T.alloc_fragment((block_M, block_N), out_dtype)
-            C_shared = T.alloc_shared((block_M, block_N), out_dtype)
+            A_shared = T.alloc_shared((1, block_N), in_dtype)
+            B_shared = T.alloc_shared((1, block_N), in_dtype)
+            C_local  = T.alloc_fragment((1, block_N), out_dtype)
+            C_shared = T.alloc_shared((1, block_N), out_dtype)

@@ examples/elementwise/example_elementwise_add_tma_1d.py:31
-            for (local_y, local_x) in T.Parallel(block_M, block_N):
-                C_local[local_y, local_x] = A_shared[local_y, local_x] + B_shared[local_y, local_x]
+            for (_, local_x) in T.Parallel(1, block_N):
+                C_local[0, local_x] = A_shared[0, local_x] + B_shared[0, local_x]

@@ examples/elementwise/example_elementwise_add_tma_1d.py:47
-    config = {"block_M": 128, "block_N": 128, "threads": 128}
+    config = {"block_M": 1,   "block_N": 128, "threads": 128}
Optionally, we can add a --use_1d flag in main() to toggle between 2D and 1D tiles, and insert an assertion after JIT that the generated IR contains cp.async.bulk.tensor.1d, guaranteeing the 1D TMA path is exercised. Let me know if you’d like that wiring and the IR check.
src/transform/inject_tma_barrier.cc (1)

458-465: LGTM: Barrier index swap for 1D loads matches the new calling convention.

The arg-index logic (2 for 1D; 1 for descriptor-based) is consistent with TmaExpectTxRewriter and the new 1D calling convention. Good coordination across passes.
src/op/copy.cc (1)
805-895: Tighten 1D TMA safety checks and confirm ABI assumptions

Please address the following in the 1D TMA‐raw‐bytes path (src/op/copy.cc:805–895):

• Bounds completeness

In the OOB loops, also assert min >= 0 for each dimension of both shared and global ranges to prevent negative offsets.
• Zero‐length copies

If shared_elements (after Simplify) is provably zero, early‐return to avoid emitting a zero‐size cp.async.
• ABI verification

Confirm the TIR builtins tma_load/tma_store actually accept five arguments in the raw-bytes form (smem, gmem, barrier_id, size, eviction_policy).

Ensure the mbarrier index for 1D loads is indeed at argument position 2 (0-based) as the inject_tma_barrier pass rewrites it.
• Size‐operand width

Hopper’s cp.async.bulk expects a 32-bit size; wrap the byte count in cast(DataType::UInt(32), …).

Apply this sketch in the existing 1D block:
@@ src/op/copy.cc:820
-    for (size_t i = 0; i < shared_range.size(); i++) {
+    for (size_t i = 0; i < shared_range.size(); i++) {
       if (!analyzer->CanProve(shared_range[i]->min + shared_range[i]->extent <=
                               shared_tensor_before_remap->shape[i])) {
+      if (!analyzer->CanProve(shared_range[i]->min >= 0)) {
         no_oob = false;
         break;
       }
@@ src/op/copy.cc:830
-    for (size_t i = 0; i < global_range.size(); i++) {
+    for (size_t i = 0; i < global_range.size(); i++) {
       if (!analyzer->CanProve(global_range[i]->min + global_range[i]->extent <=
                               global_tensor->shape[i])) {
+      if (!analyzer->CanProve(global_range[i]->min >= 0)) {
         no_oob = false;
         break;
       }
@@ src/op/copy.cc:850
-      PrimExpr elements = analyzer->Simplify(shared_elements);
+      PrimExpr elements = analyzer->Simplify(shared_elements);
+      // Drop zero‐length copies
+      if (analyzer->CanProveEqual(elements, 0)) {
+        return Evaluate(0);
+      }
@@ src/op/copy.cc:860
-            {shared_addr, global_addr, 0,
-             elements * shared_tensor_before_remap->dtype.bytes(),
-             this->eviction_policy});
+            {shared_addr,
+             global_addr,
+             /* barrier_id */ 0,
+             cast(DataType::UInt(32),
+                  elements * shared_tensor_before_remap->dtype.bytes()),
+             this->eviction_policy});
@@ src/op/copy.cc:867
-            {global_addr, shared_addr,
-             elements * shared_tensor_before_remap->dtype.bytes(),
-             this->eviction_policy});
+            {global_addr,
+             shared_addr,
+             cast(DataType::UInt(32),
+                  elements * shared_tensor_before_remap->dtype.bytes()),
+             this->eviction_policy});
After making these changes, please verify:

The TIR builtin signatures for tma_load/tma_store include the eviction_policy parameter and use a 32-bit size argument.

The barrier index is indeed argument 2 in inject_tma_barrier.cc.

coderabbitai · 2025-08-26T11:10:07Z

src/op/copy.cc

+  Array<PrimExpr> global_indices;
+  for (auto r : global_range) {
+    global_indices.push_back(r->min);
+  }
+  std::vector<PrimExpr> global_strides;
+  PrimExpr global_stride = 1;
+  for (size_t i = 0; i < global_tensor->shape.size(); i++) {
+    auto s = global_tensor->shape[global_tensor->shape.size() - i - 1];
+    global_strides.insert(global_strides.begin(), global_stride);
+    global_stride *= s;
+  }
+


🛠️ Refactor suggestion

Don’t assume compact layout for address math; use buffer strides when available.

The 1D path computes global_strides (and earlier, shared strides) from shapes, implicitly assuming a compact/row-major layout. If Buffer.strides is present (non-compact tensors, slices, or external buffers), the current offset calculations for global_offset/offset will be wrong.

Use global_tensor->strides when available; otherwise, fall back to the current shape-derived logic.

Consider mirroring this for shared_tensor_before_remap as well to future-proof shared slices with explicit strides.

Apply this diff within the changed hunk to honor global_tensor->strides:

- std::vector<PrimExpr> global_strides; - PrimExpr global_stride = 1; - for (size_t i = 0; i < global_tensor->shape.size(); i++) { - auto s = global_tensor->shape[global_tensor->shape.size() - i - 1]; - global_strides.insert(global_strides.begin(), global_stride); - global_stride *= s; - } + std::vector<PrimExpr> global_strides; + if (!global_tensor->strides.empty()) { + // Respect explicit (possibly non-compact) strides + global_strides = std::vector<PrimExpr>(global_tensor->strides.begin(), + global_tensor->strides.end()); + } else { + // Derive element strides from shape (compact layout) + PrimExpr global_stride = 1; + for (size_t i = 0; i < global_tensor->shape.size(); i++) { + auto s = global_tensor->shape[global_tensor->shape.size() - i - 1]; + global_strides.insert(global_strides.begin(), global_stride); + global_stride *= s; + } + }

Additionally (outside the changed hunk), consider the analogous update for the shared side:

// If shared_tensor_before_remap->strides not empty, form `strides` from it; // otherwise keep the existing shape-derived logic.

If you’d like, I can draft the corresponding update for the shared strides block and thread through unit tests that cover non-compact cases (e.g., strided views and sliced buffers).

Also applies to: 794-799

🤖 Prompt for AI Agents

In src/op/copy.cc around lines 775-786 (and similarly 794-799), the code currently derives global_strides from shapes assuming compact layout; change it to check if global_tensor->strides is non-empty and, if so, build global_strides from that vector (preserving order consistent with existing index arithmetic), otherwise fall back to the current shape-derived logic; also mirror the same pattern for the shared_tensor_before_remap strides block elsewhere so shared buffers with explicit strides are honored; ensure the rest of the offset math uses these stride vectors and that no assumptions of row-major compactness remain.

coderabbitai · 2025-08-26T11:10:07Z

src/tl_templates/cuda/copy_sm90.h

+template <CacheHintSm90 cache_hint = CacheHintSm90::EVICT_NORMAL>
+TL_DEVICE void tma_store(void *gmem_ptr, void *smem_ptr, uint32_t size) {
+  uint32_t smem_int_ptr = smem_ptr_to_uint(smem_ptr);
+  asm volatile("cp.async.bulk.global.shared::cta.bulk_group"
+               ".L2::cache_hint [%0], [%1], %2, %3;"
+               :
+               : "l"(gmem_ptr), "r"(smem_int_ptr), "r"(size), "l"(cache_hint)
+               :);
+}


💡 Verification agent

🧩 Analysis chain

Add memory clobber and confirm cp.async.bulk syntax (consider adding ::bytes).

The inline asm for the new size-based tma_store lacks a "memory" clobber. Given it performs a GMEM write, compilers may reorder surrounding memory ops. The descriptor-based tma_store overloads use "memory"; mirror that here.

Also, please double-check the PTX mnemonic. The non-descriptor tma_load variant explicitly includes "::bytes" in the opcode. For symmetry and to match Hopper docs, the non-descriptor store path typically uses the "::bytes" variant as well. If required, add it to prevent PTX parsing issues.

Suggested diff:

TL_DEVICE void tma_store(void *gmem_ptr, void *smem_ptr, uint32_t size) { uint32_t smem_int_ptr = smem_ptr_to_uint(smem_ptr); - asm volatile("cp.async.bulk.global.shared::cta.bulk_group" - ".L2::cache_hint [%0], [%1], %2, %3;" - : - : "l"(gmem_ptr), "r"(smem_int_ptr), "r"(size), "l"(cache_hint) - :); + asm volatile("cp.async.bulk.global.shared::cta.bulk_group" + "::bytes.L2::cache_hint [%0], [%1], %2, %3;" + : + : "l"(gmem_ptr), "r"(smem_int_ptr), "r"(size), "l"(cache_hint) + : "memory"); }

If your PTX toolchain already accepts the current mnemonic, at minimum add the "memory" clobber. I can help run a quick compile check to validate the opcode variant you intend to use.

🏁 Script executed:

#!/bin/bash set -e # Locate the copy_sm90.h file in the repository file=$(fd copy_sm90.h | head -n1) echo "Processing file: $file" # 1. List all tma_store overloads in this file echo -e "\n=== tma_store overloads ===" rg -nP '\btemplate.*tma_store' -n "$file" # 2. Show all inline asm usages of cp.async in this file echo -e "\n=== cp.async usages ===" rg -n 'cp.async' -n "$file" # 3. Check for any descriptor-based tma_store overload (e.g., taking a TmaDescriptor) echo -e "\n=== Descriptor-based tma_store (TmaDescriptor) ===" rg -n 'TmaDescriptor' -n . # 4. For each tma_store overload, inspect whether the asm has a "memory" clobber echo -e "\n=== Check for memory clobber in tma_store asm ===" rg -nP 'asm volatile$"cp.async.*\:.*\:.*"[^"]*$\s*;\s*:' -n "$file" -A2 | grep -C2 clobber || echo "No explicit memory clobber found in above asm blocks"

Length of output: 331

🏁 Script executed:

#!/bin/bash set -e # File path file="src/tl_templates/cuda/copy_sm90.h" echo "1) tma_store occurrences in $file" grep -n "tma_store" "$file" || echo "No tma_store found" echo -e "\n2) TmaDescriptor usage in repository" grep -R -n "TmaDescriptor" . || echo "No TmaDescriptor references" echo -e "\n3) cp.async usages in $file with context" grep -n -C2 "cp.async" "$file" || echo "No cp.async found" echo -e "\n4) Snippet around lines 160-182 in $file" sed -n '160,182p' "$file"

Length of output: 6357

🏁 Script executed:

#!/bin/bash set -e # Display the first descriptor-based tma_store overload to verify clobber and opcode syntax echo "=== Descriptor-based tma_store (1d) snippet ===" sed -n '185,200p' src/tl_templates/cuda/copy_sm90.h

Length of output: 938

Add missing “memory” clobber (and optionally ::bytes) to the size‐based tma_store

The size-based store at src/tl_templates/cuda/copy_sm90.h (around lines 175–182) currently omits a clobber list and does not annotate the opcode with ::bytes. All descriptor-based tma_store overloads include a "memory" clobber to prevent the compiler from reordering surrounding memory operations. At minimum, mirror their clobber list here. If you also want symmetry with the load path and Hopper documentation, insert ::bytes between bulk_group and .L2.

Suggested diff:

template <CacheHintSm90 cache_hint = CacheHintSm90::EVICT_NORMAL> TL_DEVICE void tma_store(void *gmem_ptr, void *smem_ptr, uint32_t size) { uint32_t smem_int_ptr = smem_ptr_to_uint(smem_ptr); - asm volatile("cp.async.bulk.global.shared::cta.bulk_group" - ".L2::cache_hint [%0], [%1], %2, %3;" - : - : "l"(gmem_ptr), "r"(smem_int_ptr), "r"(size), "l"(cache_hint) - :); + asm volatile("cp.async.bulk.global.shared::cta.bulk_group" + "::bytes.L2::cache_hint [%0], [%1], %2, %3;" + : + : "l"(gmem_ptr), "r"(smem_int_ptr), "r"(size), "l"(cache_hint) + : "memory"); }

• "memory" clobber: required to ensure correct ordering of adjacent loads/stores.
• ::bytes: optional if your PTX assembler already accepts the unannotated mnemonic; recommended for consistency with the load path.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

template <CacheHintSm90 cache_hint = CacheHintSm90::EVICT_NORMAL>

TL_DEVICE void tma_store(void *gmem_ptr, void *smem_ptr, uint32_t size) {

uint32_t smem_int_ptr = smem_ptr_to_uint(smem_ptr);

asm volatile("cp.async.bulk.global.shared::cta.bulk_group"

".L2::cache_hint [%0], [%1], %2, %3;"

:

: "l"(gmem_ptr), "r"(smem_int_ptr), "r"(size), "l"(cache_hint)

:);

}

template <CacheHintSm90 cache_hint = CacheHintSm90::EVICT_NORMAL>

TL_DEVICE void tma_store(void *gmem_ptr, void *smem_ptr, uint32_t size) {

uint32_t smem_int_ptr = smem_ptr_to_uint(smem_ptr);

asm volatile("cp.async.bulk.global.shared::cta.bulk_group"

"::bytes.L2::cache_hint [%0], [%1], %2, %3;"

:

: "l"(gmem_ptr), "r"(smem_int_ptr), "r"(size), "l"(cache_hint)

: "memory");

}

🤖 Prompt for AI Agents

In src/tl_templates/cuda/copy_sm90.h around lines 174–182, the size-based tma_store inline asm omits the required "memory" clobber (and lacks the optional ::bytes opcode annotation), which can allow the compiler to reorder surrounding memory operations; update the asm template to include "memory" in the clobber list (matching the descriptor-based overloads) and, for symmetry with the load path and Hopper docs, insert ::bytes between bulk_group and .L2 in the opcode string if your assembler accepts it.

coderabbitai · 2025-08-26T11:10:07Z

src/transform/inject_tma_barrier.cc

+      auto arg0 = call->args[0].as<Call>();
+      if (call->op.same_as(tma_load()) && arg0 &&
+          !arg0.value()->op.same_as(create_tma_descriptor())) {
+        // 1D TMA load has tvm_access_ptr of shared tensor in its args[0]
+        bulk_copy_bytes = call->args[3] * loop_extents;
+      } else {
+        Call access_ptr = Downcast<Call>(call->args[2]);
+        ICHECK(access_ptr->op.same_as(builtin::tvm_access_ptr()));
+        int type_bytes = access_ptr->args[0]->dtype.bytes();
+        bulk_copy_bytes += access_ptr->args[3] * loop_extents * type_bytes;
+      }


⚠️ Potential issue

Bug: bulk_copy_bytes overwritten for 1D loads; should accumulate (+=), not assign (=).

For multiple 1D tma_load ops inside the same block/then-case, using '=' drops previously accounted bytes, leading to under-issuing mbarrier_expect_tx. Match the 2D path’s accumulation.

Apply this fix:

- // 1D TMA load has tvm_access_ptr of shared tensor in its args[0] - bulk_copy_bytes = call->args[3] * loop_extents; + // 1D TMA load has tvm_access_ptr of shared tensor in its args[0] + bulk_copy_bytes += call->args[3] * loop_extents;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

auto arg0 = call->args[0].as<Call>();

if (call->op.same_as(tma_load()) && arg0 &&

!arg0.value()->op.same_as(create_tma_descriptor())) {

// 1D TMA load has tvm_access_ptr of shared tensor in its args[0]

bulk_copy_bytes = call->args[3] * loop_extents;

} else {

Call access_ptr = Downcast<Call>(call->args[2]);

ICHECK(access_ptr->op.same_as(builtin::tvm_access_ptr()));

int type_bytes = access_ptr->args[0]->dtype.bytes();

bulk_copy_bytes += access_ptr->args[3] * loop_extents * type_bytes;

}

auto arg0 = call->args[0].as<Call>();

if (call->op.same_as(tma_load()) && arg0 &&

!arg0.value()->op.same_as(create_tma_descriptor())) {

// 1D TMA load has tvm_access_ptr of shared tensor in its args[0]

bulk_copy_bytes += call->args[3] * loop_extents;

} else {

Call access_ptr = Downcast<Call>(call->args[2]);

ICHECK(access_ptr->op.same_as(builtin::tvm_access_ptr()));

int type_bytes = access_ptr->args[0]->dtype.bytes();

bulk_copy_bytes += access_ptr->args[3] * loop_extents * type_bytes;

}

🤖 Prompt for AI Agents

In src/transform/inject_tma_barrier.cc around lines 65 to 75, the 1D tma_load branch uses assignment to set bulk_copy_bytes which overwrites prior counts; change the logic to accumulate into bulk_copy_bytes (use += semantics) like the 2D path so multiple 1D loads add their byte counts instead of replacing them, ensuring you multiply by loop_extents as before and keep types/expressions identical except using addition.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

src/op/copy.cc (2)
805-905: Reduce duplication with small helpers (contiguity, element count, OOB)

The shared/global contiguity checks, element count, and OOB logic are duplicated in this block. Extract to local statics or lambda helpers to simplify and reduce risk of drift.

775-786: Use actual buffer strides when present; current global_offset math is wrong for non-compact tensors

Deriving strides from shape breaks for sliced/strided buffers and can copy wrong memory.

Apply:
   Array<PrimExpr> global_indices;
   for (auto r : global_range) {
     global_indices.push_back(r->min);
   }
-  std::vector<PrimExpr> global_strides;
-  PrimExpr global_stride = 1;
-  for (size_t i = 0; i < global_tensor->shape.size(); i++) {
-    auto s = global_tensor->shape[global_tensor->shape.size() - i - 1];
-    global_strides.insert(global_strides.begin(), global_stride);
-    global_stride *= s;
-  }
+  std::vector<PrimExpr> global_strides;
+  if (!global_tensor->strides.empty()) {
+    global_strides.assign(global_tensor->strides.begin(),
+                          global_tensor->strides.end());
+  } else {
+    PrimExpr global_stride = 1;
+    for (size_t i = 0; i < global_tensor->shape.size(); i++) {
+      auto s = global_tensor->shape[global_tensor->shape.size() - i - 1];
+      global_strides.insert(global_strides.begin(), global_stride);
+      global_stride *= s;
+    }
+  }
Additionally (outside this hunk), mirror the same logic for the shared side when building strides so shared views with explicit strides are honored.

🧹 Nitpick comments (5)

src/op/op.h (1)

52-53: LowerArgs: clarify and de-dup GEMM buffers

Prefer a Set semantics (or de-dupe at producer) to avoid duplicates propagating through T. buffer_var_gemm.

Add a short comment documenting intended use (gating swizzle/1D-TMA decisions).

src/transform/lower_tile_op.cc (2)

15-16: Include seems unnecessary

#include "../op/gemm.h" isn’t used directly in this TU. Consider removing to keep dependencies lean.

222-227: OK to collect early; consider moving de-dupe here if not done in collector

If you prefer to keep the collector simple, de-dup substituter.buffer_var_gemm_ after assignment.
src/op/copy.cc (2)
848-873: Bounds check misses lower bound

You check upper bounds only; add min >= 0 to avoid negative indexing.

Apply:
-    for (size_t i = 0; i < shared_range.size(); i++) {
-      if (!analyzer->CanProve(shared_range[i]->min + shared_range[i]->extent <=
-                              shared_tensor_before_remap->shape[i])) {
+    for (size_t i = 0; i < shared_range.size(); i++) {
+      if (!analyzer->CanProve(shared_range[i]->min >= 0) ||
+          !analyzer->CanProve(shared_range[i]->min + shared_range[i]->extent <=
+                              shared_tensor_before_remap->shape[i])) {
         no_oob = false;
         break;
       }
     }
-    for (size_t i = 0; i < global_range.size(); i++) {
-      if (!analyzer->CanProve(global_range[i]->min + global_range[i]->extent <=
-                              global_tensor->shape[i])) {
+    for (size_t i = 0; i < global_range.size(); i++) {
+      if (!analyzer->CanProve(global_range[i]->min >= 0) ||
+          !analyzer->CanProve(global_range[i]->min + global_range[i]->extent <=
+                              global_tensor->shape[i])) {
         no_oob = false;
         break;
       }
     }
883-901: Size argument width

The sm90 tma_{load,store} 1D helpers take a 32-bit size. Cast explicitly to avoid width mismatches when elements is 64-bit.

Apply:
-                           elements * shared_tensor_before_remap->dtype.bytes(),
+                           cast(DataType::UInt(32),
+                                elements * shared_tensor_before_remap->dtype.bytes()),
…and likewise for the store call.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 31c66b8 and 4eff7bd.

📒 Files selected for processing (3)

src/op/copy.cc (2 hunks)
src/op/op.h (1 hunks)
src/transform/lower_tile_op.cc (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/transform/lower_tile_op.cc (2)

src/transform/warp_specialized_rewriter.cc (14)

op (50-55)

op (50-50)

op (85-95)

op (85-85)

op (97-105)

op (97-97)

op (107-112)

op (107-107)

op (114-122)

op (114-114)

call (43-48)

call (43-43)

f (1170-1189)

f (1170-1171)

tilelang/language/tir/op.py (1)

tvm_access_ptr (650-675)

src/op/copy.cc (2)

src/op/copy.h (1)

Copy (86-283)

tilelang/language/copy.py (1)

copy (84-152)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-test-amd

🔇 Additional comments (3)

src/transform/lower_tile_op.cc (2)

502-506: LowerArgs propagation looks correct

New field threaded through Lower() call consistently.

539-540: Private member naming consistency

Align with existing members: consider buffer_var_gemm_ already matches style; good.

src/op/copy.cc (1)

1343-1348: API alignment: set_num_inputs(5) matches Python wrapper

Registration updated to 5 inputs and matches tilelang/language/copy.py (src, dst, coalesced_width, disable_tma, eviction_policy).

coderabbitai · 2025-08-27T10:50:13Z

src/op/copy.cc

+  // Add 1D TMA copy when the global and shared memory is contiguous
+  {
+    // Check if shared_tensor->name is present in T.buffer_var_gemm
+    // (Array<PrimExpr>) to avoid use 1D TMA copy for swizzled layout
+    bool shared_is_contiguous = true;
+    for (const auto &v : T.buffer_var_gemm) {
+      if (v->name_hint == shared_tensor->name) {
+        shared_is_contiguous = false;
+        break;
+      }
+    }
+    bool shared_not_full_dim_encounter = false;


⚠️ Potential issue

Avoid name-based GEMM detection; use Var identity and skip 1D path when shared has a layout map

Comparing v->name_hint to shared_tensor->name can miss due to buffer remap; compare the Var to shared_tensor_before_remap->data.

If T.layout_map contains the shared buffer, 1D TMA path will produce a tvm_access_ptr that later remaps only the Var (not the offset), yielding wrong addressing.

Apply:

- // Add 1D TMA copy when the global and shared memory is contiguous - { + // Add 1D TMA copy when the global and shared memory is contiguous + // Only when shared has no remapped/swizzled layout. + if (!T.layout_map.count(shared_tensor_before_remap)) { // Check if shared_tensor->name is present in T.buffer_var_gemm // (Array<PrimExpr>) to avoid use 1D TMA copy for swizzled layout - bool shared_is_contiguous = true; - for (const auto &v : T.buffer_var_gemm) { - if (v->name_hint == shared_tensor->name) { - shared_is_contiguous = false; - break; - } - } + bool shared_is_contiguous = true; + for (const auto &v : T.buffer_var_gemm) { + if (v.same_as(shared_tensor_before_remap->data)) { + shared_is_contiguous = false; + break; + } + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Add 1D TMA copy when the global and shared memory is contiguous

{

// Check if shared_tensor->name is present in T.buffer_var_gemm

// (Array<PrimExpr>) to avoid use 1D TMA copy for swizzled layout

bool shared_is_contiguous = true;

for (const auto &v : T.buffer_var_gemm) {

if (v->name_hint == shared_tensor->name) {

shared_is_contiguous = false;

break;

}

}

bool shared_not_full_dim_encounter = false;

// Add 1D TMA copy when the global and shared memory is contiguous

// Only when shared has no remapped/swizzled layout.

if (!T.layout_map.count(shared_tensor_before_remap)) {

// Check if shared_tensor->name is present in T.buffer_var_gemm

// (Array<PrimExpr>) to avoid use 1D TMA copy for swizzled layout

bool shared_is_contiguous = true;

for (const auto &v : T.buffer_var_gemm) {

if (v.same_as(shared_tensor_before_remap->data)) {

shared_is_contiguous = false;

break;

}

}

bool shared_not_full_dim_encounter = false;

}

coderabbitai · 2025-08-27T10:50:13Z

src/op/copy.cc

+    bool global_is_contiguous = !global_tensor->strides.empty();
+    bool global_not_full_dim_encounter = false;
+    for (ssize_t i = global_range.size() - 1; i >= 0; --i) {
+      if (!global_not_full_dim_encounter) {
+        if (!analyzer->CanProve(global_range[i]->extent ==
+                                    global_tensor->shape[i] &&
+                                global_range[i]->min == 0)) {
+          global_not_full_dim_encounter = true;
+        }
+      } else {
+        if (!analyzer->CanProve(global_range[i]->extent == 1)) {
+          global_is_contiguous = false;
+          break;
+        }
+      }
+    }


⚠️ Potential issue

global_is_contiguous initialization bug

bool global_is_contiguous = !global_tensor->strides.empty(); incorrectly marks arbitrary strided tensors as contiguous. Initialize to true and let the checks invalidate it.

Apply:

- // Currently we check the empty stride of global tensor - bool global_is_contiguous = !global_tensor->strides.empty(); + // Start optimistic; invalidate below if pattern breaks contiguity + bool global_is_contiguous = true;

Optionally, add an extra guard to verify explicit strides match compact layout when global_tensor->strides is provided.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

bool global_is_contiguous = !global_tensor->strides.empty();

bool global_not_full_dim_encounter = false;

for (ssize_t i = global_range.size() - 1; i >= 0; --i) {

if (!global_not_full_dim_encounter) {

if (!analyzer->CanProve(global_range[i]->extent ==

global_tensor->shape[i] &&

global_range[i]->min == 0)) {

global_not_full_dim_encounter = true;

}

} else {

if (!analyzer->CanProve(global_range[i]->extent == 1)) {

global_is_contiguous = false;

break;

}

}

}

// Start optimistic; invalidate below if pattern breaks contiguity

bool global_is_contiguous = true;

bool global_not_full_dim_encounter = false;

for (ssize_t i = global_range.size() - 1; i >= 0; --i) {

if (!global_not_full_dim_encounter) {

if (!analyzer->CanProve(global_range[i]->extent ==

global_tensor->shape[i] &&

global_range[i]->min == 0)) {

global_not_full_dim_encounter = true;

}

} else {

if (!analyzer->CanProve(global_range[i]->extent == 1)) {

global_is_contiguous = false;

break;

}

}

}

🤖 Prompt for AI Agents

In src/op/copy.cc around lines 832 to 847, the variable global_is_contiguous is incorrectly initialized from !global_tensor->strides.empty(), which treats tensors with arbitrary strides as contiguous; change its initialization to true and let the subsequent loop set it false when non-contiguous conditions are found. Also (optionally) add a guard: when global_tensor->strides is non-empty, verify the explicit strides correspond to a compact (row-major) layout for the given shape and set global_is_contiguous=false if they don’t match.

coderabbitai · 2025-08-27T10:50:13Z

src/transform/lower_tile_op.cc

+class BufferGemmCollector : public StmtExprVisitor {
+public:
+  BufferGemmCollector() { Clear(); }
+
+  void Clear() { buffer_var_gemm_.clear(); }
+
+  void Collect(Stmt stmt) { VisitStmt(stmt); }
+
+  Array<Var> GetBufferVarGemm() { return buffer_var_gemm_; }
+
+private:
+  void VisitStmt_(const EvaluateNode *op) {
+    auto call = Downcast<Call>(op->value);
+    if (call->op.same_as(Op::Get("tl.gemm"))) {
+      auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]);
+      ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
+      auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]);
+      auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]);
+      ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
+      auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]);
+      auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]);
+      ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
+      auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]);
+      buffer_var_gemm_.push_back(srcA_buffer_var);
+      buffer_var_gemm_.push_back(srcB_buffer_var);
+      buffer_var_gemm_.push_back(dst_buffer_var);
+      // LOG(INFO) << "buffer_var_gemm_: " << buffer_var_gemm_;
+    } else if (call->op.same_as(Op::Get("tl.gemm_sp"))) {
+      auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]);
+      ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
+      auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]);
+      auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]);
+      ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
+      auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]);
+      auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]);
+      ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
+      auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]);
+      buffer_var_gemm_.push_back(srcA_buffer_var);
+      buffer_var_gemm_.push_back(srcB_buffer_var);
+      buffer_var_gemm_.push_back(dst_buffer_var);
+      // LOG(INFO) << "buffer_var_gemm_: " << buffer_var_gemm_;
+    }
+  }
+
+  Array<Var> buffer_var_gemm_;
+};
+


🛠️ Refactor suggestion

Harden BufferGemmCollector against non-Call evaluates; avoid crashes and duplicates

Downcasting Evaluate.value to Call without a guard can assert at runtime.

Collect may see repeated GEMM calls; de-dup vars to keep T.buffer_var_gemm small.

Apply:

class BufferGemmCollector : public StmtExprVisitor { public: BufferGemmCollector() { Clear(); } void Clear() { buffer_var_gemm_.clear(); } void Collect(Stmt stmt) { VisitStmt(stmt); } Array<Var> GetBufferVarGemm() { return buffer_var_gemm_; } private: - void VisitStmt_(const EvaluateNode *op) { - auto call = Downcast<Call>(op->value); + void VisitStmt_(const EvaluateNode *op) { + const CallNode* call_node = op->value.as<CallNode>(); + if (!call_node) return; + Call call = GetRef<Call>(call_node); if (call->op.same_as(Op::Get("tl.gemm"))) { auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]); ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr())); auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]); auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]); ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr())); auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]); auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]); ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr())); auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]); - buffer_var_gemm_.push_back(srcA_buffer_var); - buffer_var_gemm_.push_back(srcB_buffer_var); - buffer_var_gemm_.push_back(dst_buffer_var); + PushUnique(srcA_buffer_var); + PushUnique(srcB_buffer_var); + PushUnique(dst_buffer_var); } else if (call->op.same_as(Op::Get("tl.gemm_sp"))) { auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]); ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr())); auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]); auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]); ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr())); auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]); auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]); ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr())); auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]); - buffer_var_gemm_.push_back(srcA_buffer_var); - buffer_var_gemm_.push_back(srcB_buffer_var); - buffer_var_gemm_.push_back(dst_buffer_var); + PushUnique(srcA_buffer_var); + PushUnique(srcB_buffer_var); + PushUnique(dst_buffer_var); } } - Array<Var> buffer_var_gemm_; + void PushUnique(const Var& v) { + for (const auto& u : buffer_var_gemm_) if (u.same_as(v)) return; + buffer_var_gemm_.push_back(v); + } + Array<Var> buffer_var_gemm_; };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class BufferGemmCollector : public StmtExprVisitor {

public:

BufferGemmCollector() { Clear(); }

void Clear() { buffer_var_gemm_.clear(); }

void Collect(Stmt stmt) { VisitStmt(stmt); }

Array<Var> GetBufferVarGemm() { return buffer_var_gemm_; }

private:

void VisitStmt_(const EvaluateNode *op) {

auto call = Downcast<Call>(op->value);

if (call->op.same_as(Op::Get("tl.gemm"))) {

auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]);

ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]);

auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]);

ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]);

auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]);

ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]);

buffer_var_gemm_.push_back(srcA_buffer_var);

buffer_var_gemm_.push_back(srcB_buffer_var);

buffer_var_gemm_.push_back(dst_buffer_var);

// LOG(INFO) << "buffer_var_gemm_: " << buffer_var_gemm_;

} else if (call->op.same_as(Op::Get("tl.gemm_sp"))) {

auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]);

ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]);

auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]);

ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]);

auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]);

ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]);

buffer_var_gemm_.push_back(srcA_buffer_var);

buffer_var_gemm_.push_back(srcB_buffer_var);

buffer_var_gemm_.push_back(dst_buffer_var);

// LOG(INFO) << "buffer_var_gemm_: " << buffer_var_gemm_;

}

}

Array<Var> buffer_var_gemm_;

};

class BufferGemmCollector : public StmtExprVisitor {

public:

BufferGemmCollector() { Clear(); }

void Clear() { buffer_var_gemm_.clear(); }

void Collect(Stmt stmt) { VisitStmt(stmt); }

Array<Var> GetBufferVarGemm() { return buffer_var_gemm_; }

private:

void VisitStmt_(const EvaluateNode *op) {

const CallNode* call_node = op->value.as<CallNode>();

if (!call_node) return;

Call call = GetRef<Call>(call_node);

if (call->op.same_as(Op::Get("tl.gemm"))) {

auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]);

ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]);

auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]);

ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]);

auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]);

ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]);

PushUnique(srcA_buffer_var);

PushUnique(srcB_buffer_var);

PushUnique(dst_buffer_var);

// LOG(INFO) << "buffer_var_gemm_: " << buffer_var_gemm_;

} else if (call->op.same_as(Op::Get("tl.gemm_sp"))) {

auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]);

ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]);

auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]);

ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]);

auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]);

ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));

auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]);

PushUnique(srcA_buffer_var);

PushUnique(srcB_buffer_var);

PushUnique(dst_buffer_var);

// LOG(INFO) << "buffer_var_gemm_: " << buffer_var_gemm_;

}

}

void PushUnique(const Var& v) {

for (const auto& u : buffer_var_gemm_)

if (u.same_as(v)) return;

buffer_var_gemm_.push_back(v);

}

Array<Var> buffer_var_gemm_;

};

🤖 Prompt for AI Agents

In src/transform/lower_tile_op.cc around lines 75-121, the visitor unconditionally Downcast's EvaluateNode->value to Call and unconditionally Downcast's call->args to Call/Var which can assert for non-Call evaluates or malformed args, and it always pushes vars allowing duplicates; update VisitStmt_ to first check that op->value is a Call (use op->value.as<CallNode>() or IsInstance<CallNode>() and return early if not), then validate each expected argument is a Call and that access_ptr->op matches tvm_access_ptr() before extracting the Var (safely handling missing args), and when adding srcA/srcB/dst vars insert them only if not already present (e.g., check buffer_var_gemm_ for existence or maintain a small set of Var identifiers) so repeated GEMM calls don’t create duplicates; keep rest of logic and logging unchanged.

…-ai#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/transform/lower_tile_op.cc (1)

75-118: Harden BufferGemmCollector: guard Downcast and de-dup buffer vars to avoid crashes and bloat.

Unconditional Downcast on Evaluate.value and args can assert for non-Call evaluates or malformed args; repeated GEMMs will duplicate entries. Add a CallNode guard and push-unique.

Apply:

 class BufferGemmCollector : public StmtExprVisitor {
 public:
   BufferGemmCollector() { Clear(); }

   void Clear() { buffer_var_gemm_.clear(); }

   void Collect(Stmt stmt) { VisitStmt(stmt); }

   Array<Var> GetBufferVarGemm() { return buffer_var_gemm_; }

 private:
-  void VisitStmt_(const EvaluateNode *op) {
-    auto call = Downcast<Call>(op->value);
+  void VisitStmt_(const EvaluateNode *op) {
+    const CallNode* call_node = op->value.as<CallNode>();
+    if (!call_node) return;
+    Call call = GetRef<Call>(call_node);
     if (call->op.same_as(Op::Get("tl.gemm"))) {
       auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]);
       ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
       auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]);
       auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]);
       ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
       auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]);
       auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]);
       ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
       auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]);
-      buffer_var_gemm_.push_back(srcA_buffer_var);
-      buffer_var_gemm_.push_back(srcB_buffer_var);
-      buffer_var_gemm_.push_back(dst_buffer_var);
+      PushUnique(srcA_buffer_var);
+      PushUnique(srcB_buffer_var);
+      PushUnique(dst_buffer_var);
     } else if (call->op.same_as(Op::Get("tl.gemm_sp"))) {
       auto srcA_buffer_access_ptr = Downcast<Call>(call->args[0]);
       ICHECK(srcA_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
       auto srcA_buffer_var = Downcast<Var>(srcA_buffer_access_ptr->args[1]);
       auto srcB_buffer_access_ptr = Downcast<Call>(call->args[1]);
       ICHECK(srcB_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
       auto srcB_buffer_var = Downcast<Var>(srcB_buffer_access_ptr->args[1]);
       auto dst_buffer_access_ptr = Downcast<Call>(call->args[2]);
       ICHECK(dst_buffer_access_ptr->op.same_as(builtin::tvm_access_ptr()));
       auto dst_buffer_var = Downcast<Var>(dst_buffer_access_ptr->args[1]);
-      buffer_var_gemm_.push_back(srcA_buffer_var);
-      buffer_var_gemm_.push_back(srcB_buffer_var);
-      buffer_var_gemm_.push_back(dst_buffer_var);
+      PushUnique(srcA_buffer_var);
+      PushUnique(srcB_buffer_var);
+      PushUnique(dst_buffer_var);
     }
   }
 
+  void PushUnique(const Var& v) {
+    for (const auto& u : buffer_var_gemm_) if (u.same_as(v)) return;
+    buffer_var_gemm_.push_back(v);
+  }
   Array<Var> buffer_var_gemm_;
 };

Optionally, replace ICHECKs with early returns to skip unexpected patterns instead of aborting the pass. I can provide that variant if desired.

🧹 Nitpick comments (2)

src/transform/lower_tile_op.cc (2)

469-472: Remove commented debugging or gate behind VLOG.

Keep source clean or use a verbose logging macro.

537-537: New member buffer_var_gemm_ LGTM.

Consistent with usage; consider comment documenting semantics (holds buffer->data Vars used by GEMM).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4eff7bd and 87a1de6.

📒 Files selected for processing (2)

src/op/copy.cc (2 hunks)
src/transform/lower_tile_op.cc (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/op/copy.cc

🧰 Additional context used

🧬 Code graph analysis (1)

src/transform/lower_tile_op.cc (1)

tilelang/language/tir/op.py (1)

tvm_access_ptr (650-675)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-amd
GitHub Check: build-test-nvidia

🔇 Additional comments (3)

src/transform/lower_tile_op.cc (3)
15-15: Include gemm.h looks correct.

Necessary for Op::Get("tl.gemm"/"tl.gemm_sp") resolution.

500-504: LowerArgs instantiation matches definition – no action needed

The LowerArgs aggregate in src/op/op.h is defined with fields in this exact order:

Target target

Range thread_bounds

Var thread_var

AddWorkspaceCallback AddWorkspace

LayoutMap layout_map

Map<Buffer, Buffer> buffer_remap

Array buffer_var_gemm

In src/transform/lower_tile_op.cc the call
LowerArgs{target_, thread_bounds, thread_var_->var, callback,
          layout_map_, buffer_remap_, buffer_var_gemm_}
passes arguments in the same sequence, so there is no mismatch between declaration and use. Everything lines up correctly.

220-225: No cross-pass surprises for LowerArgs arity

Repository-wide search confirms the only LowerArgs{…} initializer is in src/transform/lower_tile_op.cc (line 501) and the only tile_op->Lower(…) call is immediately above it (line 500).

No other callers of LowerArgs or tile_op->Lower exist in the codebase, so the added buffer_var_gemm_ argument cannot break downstream passes.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

examples/gdn/README.md (1)
1-1: Title phrasing: consider "in TileLang" for style consistency.

"Implementation in TileLang" reads a bit cleaner than "with TileLang." Optional.

Apply within this hunk:
-# Gated Delta Net (GDN) kernel implementation with TileLang
+# Gated Delta Net (GDN) kernel implementation in TileLang

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 87a1de6 and a924ccf.

📒 Files selected for processing (2)

3rdparty/cutlass (1 hunks)
examples/gdn/README.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

examples/gdn/README.md

[grammar] ~5-~5: There might be a mistake here.
Context: ...h TileLang ## Requirement - TileLang: 0.1.5+17fafc1b3026d910a83eb8052fdf811ba56be0b1 - Triton: 3.3.0 (used for comparison) - ...

(QB_NEW_EN)

[grammar] ~6-~6: There might be a mistake here.
Context: ... - Triton: 3.3.0 (used for comparison) - FLA: commit f03cb3ae (used for compari...

(QB_NEW_EN)

🔇 Additional comments (3)

3rdparty/cutlass (2)
1-1: LGTM: submodule pointer bump only

This file just updates the CUTLASS submodule commit. No issues here.

1-1: Ensure all GitHub Actions workflows fetch submodules with full history

The .gitmodules entry and Gitlink for 3rdparty/cutlass are correctly configured, but none of our workflows enable submodule checkout. Some also lack fetch-depth: 0. To avoid intermittent build/checkout failures, update every actions/checkout step to include both:

fetch-depth: 0

submodules: true

Workflows needing updates:

• .github/workflows/amd_ci.yml
• .github/workflows/bot.yml
• .github/workflows/ci.yml
• .github/workflows/dependabot.yml
• .github/workflows/publish_docs.yml
• .github/workflows/reminder.yml

Example patch for each checkout step:
 steps:
   - name: Checkout repository
     uses: actions/checkout@v4
     with:
-      fetch-depth: 0
+      fetch-depth: 0
+      submodules: true
Apply the same change for every occurrence of actions/checkout (adjust version as needed). This will ensure CI always pulls the full CUTLASS history and avoids checkout/build flakes.

Likely an incorrect or invalid review comment.
examples/gdn/README.md (1)

13-14: Acknowledgements read well.

Wording is clear and appropriately credits contributors.

coderabbitai · 2025-08-27T11:12:18Z

examples/gdn/README.md

+- TileLang: `0.1.5+17fafc1b3026d910a83eb8052fdf811ba56be0b1`
+- Triton: `3.3.0` (used for comparison)
+- FLA: commit `f03cb3ae` (used for comparison)


⚠️ Potential issue

Requirements: tighten wording, expand acronym, and standardize pin format.

Prefer "baseline" over repeated "used for comparison".

Expand "FLA" on first mention.

Keep commit hash but format as “version (commit )” for readability.

Apply within this hunk:

-- TileLang: `0.1.5+17fafc1b3026d910a83eb8052fdf811ba56be0b1` -- Triton: `3.3.0` (used for comparison) -- FLA: commit `f03cb3ae` (used for comparison) +- TileLang: `0.1.5` (commit `17fafc1b`) +- Triton: `3.3.0` (baseline) +- Flash… (FLA): commit `f03cb3ae` (baseline)

Additionally change the section header outside this hunk for correctness (plural):

-## Requirement +## Requirements

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 LanguageTool

[grammar] ~5-~5: There might be a mistake here.
Context: ...h TileLang ## Requirement - TileLang: 0.1.5+17fafc1b3026d910a83eb8052fdf811ba56be0b1 - Triton: 3.3.0 (used for comparison) - ...

(QB_NEW_EN)

[grammar] ~6-~6: There might be a mistake here.
Context: ... - Triton: 3.3.0 (used for comparison) - FLA: commit f03cb3ae (used for compari...

(QB_NEW_EN)

🤖 Prompt for AI Agents

In examples/gdn/README.md around lines 5 to 7, the tooling pins and wording need tightening: replace repetitive "used for comparison" with "baseline", expand "FLA" to "FLA (Formal Language Analyzer)" on first mention, and reformat the commit pin to "version (commit <short>)" keeping the commit hash shortened; for example use "TileLang: 0.1.5 (commit 17fafc1b)" and "FLA (Formal Language Analyzer): f03cb3ae (commit f03cb3ae)" as applicable, and also update the nearby section header (outside this hunk) to use the plural form for correctness. Ensure wording is concise and consistent across the three lines.

coderabbitai · 2025-08-27T11:12:18Z

examples/gdn/README.md

+ The [chunk_delta_h](common/chunk_delta_h.py) implements the most critical forward kernel of GDN. It's a good start to understand the GDN logic and the TileLang optimization.
+


💡 Verification agent

❓ Verification inconclusive

Grammar and clarity: reference the file and tighten the sentence.

Use “file”/“script”, “starting point”, and plural “optimizations”.

Apply within this hunk:

- The [chunk_delta_h](common/chunk_delta_h.py) implements the most critical forward kernel of GDN. It's a good start to understand the GDN logic and the TileLang optimization. + The file [common/chunk_delta_h.py](common/chunk_delta_h.py) implements the most critical forward kernel of GDN. It's a good starting point to understand the GDN logic and TileLang-specific optimizations.

🏁 Script executed:

#!/bin/bash # Verify the referenced path exists and the link will work set -euo pipefail if [ -f "examples/gdn/common/chunk_delta_h.py" ]; then echo "OK: examples/gdn/common/chunk_delta_h.py exists" else # Fall back to repo-root relative path check if [ -f "common/chunk_delta_h.py" ]; then echo "OK: common/chunk_delta_h.py exists at repo root; update link if needed" else echo "Missing: chunk_delta_h.py not found; fix README link" >&2 exit 1 fi fi

Length of output: 230

🏁 Script executed:

#!/bin/bash # Find the location of chunk_delta_h.py in the repository set -euo pipefail find . -type f -name "chunk_delta_h.py"

Length of output: 59

Fix broken link and tighten README wording

It looks like the README refers to [common/chunk_delta_h.py], but no such file exists in the repository. Please confirm the correct filename and relative path, update the link, and apply the grammar tweaks below.

File: examples/gdn/README.md lines 11–12

Verify that chunk_delta_h.py exists and update the Markdown link to the correct path.

Tighten the sentence: use “file”, “starting point”, and plural “optimizations”.

Suggested diff (replace <correct-path> with the actual relative path):

- The [chunk_delta_h](common/chunk_delta_h.py) implements the most critical forward kernel of GDN. It's a good start to understand the GDN logic and the TileLang optimization. + The file [common/chunk_delta_h.py](<correct-path>) implements the most critical forward kernel of GDN. It's a good starting point to understand the GDN logic and TileLang-specific optimizations.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In examples/gdn/README.md around lines 11 to 12, the Markdown link points to a non-existent file name common/chunk_delta_h.py and the sentence needs tightening; confirm the correct filename and relative path to the actual implementation (e.g., common/chunk_delta_h.c or scripts/chunk_delta_h.py), update the Markdown link to that correct relative path, and reword the sentence to say something like: "The <correct-path> file implements the most critical forward kernel of GDN and is a good starting point to understand the GDN logic and TileLang optimizations."

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

examples/elementwise/example_elementwise_add_tma_1d.py (2)
22-27: Make the 1D TMA path more explicit (optional).

If the intent is to exercise the 1D TMA fast-path, hint the pass by setting coalesced_width to the full tile size on copies. This keeps behavior identical but nudges lowering toward a single contiguous transaction when conditions permit.
-            T.copy(A[by * block_M, bx * block_N], A_shared)
-            T.copy(B[by * block_M, bx * block_N], B_shared)
+            T.copy(A[by * block_M, bx * block_N], A_shared, coalesced_width=block_M * block_N)
+            T.copy(B[by * block_M, bx * block_N], B_shared, coalesced_width=block_M * block_N)
@@
-            T.copy(C_local, C_shared)
-            T.copy(C_shared, C[by * block_M, bx * block_N])
+            T.copy(C_local, C_shared, coalesced_width=block_M * block_N)
+            T.copy(C_shared, C[by * block_M, bx * block_N], coalesced_width=block_M * block_N)
32-45: Guard against partial tiles (optional).

As written, arbitrary M/N may lead to edge copies that exceed bounds unless the lowering inserts masking. Since this is an example, a simple precondition assert is fine.
     M, N = args.m, args.n
@@
-    config = {"block_M": 128, "block_N": 128, "threads": 128}
+    config = {"block_M": 128, "block_N": 128, "threads": 128}
+    assert M % config["block_M"] == 0 and N % config["block_N"] == 0, (
+        "For this example, require M and N to be multiples of block sizes "
+        f'({config["block_M"]}, {config["block_N"]}).'
+    )
If masking is already handled in lowering, feel free to skip this assert.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a924ccf and 50df72b.

📒 Files selected for processing (1)

examples/elementwise/example_elementwise_add_tma_1d.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/elementwise/example_elementwise_add_tma_1d.py (4)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/language/allocate.py (2)

alloc_shared (20-35)

alloc_fragment (52-63)

tilelang/language/copy.py (1)

copy (84-152)

tilelang/language/parallel.py (1)

Parallel (8-28)

🪛 Ruff (0.12.2)

examples/elementwise/example_elementwise_add_tma_1d.py

39-39: Undefined name torch

(F821)

39-39: Undefined name torch

(F821)

40-40: Undefined name torch

(F821)

40-40: Undefined name torch

(F821)

47-47: Undefined name torch

(F821)

🪛 GitHub Actions: CI Test on AMD

examples/elementwise/example_elementwise_add_tma_1d.py

[error] 39-39: F821 Undefined name 'torch'.

[error] 40-40: F821 Undefined name 'torch'.

[error] 47-47: F821 Undefined name 'torch'.

🔇 Additional comments (2)

examples/elementwise/example_elementwise_add_tma_1d.py (2)

10-12: JIT wrapper usage looks correct.

Using @tilelang.jit(out_idx=[-1]) to auto-allocate/return C matches the PrimFunc signature (A, B, C).

17-21: Good staging design for TMA store.

Using a shared buffer (C_shared) as the staging location for the final store aligns with TMA store requirements and avoids per-thread scattered writes.

examples/elementwise/example_elementwise_add_tma_1d.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

examples/elementwise/example_elementwise_add_tma_1d.py (1)

4-4: Resolved: missing torch import.

Matches the prior CI comment; F821 should be gone.

🧹 Nitpick comments (4)

examples/elementwise/example_elementwise_add_tma_1d.py (4)
43-46: Make the example actually exercise 1D TMA.

Current defaults (block_M=128, block_N=128) will typically trigger 2D TMA. Use block_M=1 so copies are contiguous 1D along N, matching the PR’s 1D TMA intent. Also hint coalescing.

Apply:
-    # Default config
-    config = {"block_M": 128, "block_N": 128, "threads": 128}
+    # Default config for 1D TMA (contiguous along N)
+    config = {"block_M": 1, "block_N": 128, "threads": 128}
And at the parallel loop (see separate comment) set coalesced_width=block_N. Please confirm this shape selects the 1D TMA path in codegen.

25-26: Hint coalesced access to help 1D path selection.

Annotate the inner parallelism with coalesced_width to better guide lowering.
-            for (local_y, local_x) in T.Parallel(block_M, block_N):
+            for (local_y, local_x) in T.Parallel(block_M, block_N, coalesced_width=block_N):
44-46: Guard against partial tiles (OOB risk).

Without tail handling, non-multiple M/N will read/write past bounds. For a minimal example, assert divisibility.
     # Default config
-    config = {"block_M": 128, "block_N": 128, "threads": 128}
+    config = {"block_M": 1, "block_N": 128, "threads": 128}
+    # Minimal example assumes perfect tiling; add tail handling if needed.
+    assert M % config["block_M"] == 0, "M must be a multiple of block_M"
+    assert N % config["block_N"] == 0, "N must be a multiple of block_N"
27-28: Clarify intentional shared-memory staging for TMA store.

Add a short comment so readers know the extra hop is deliberate to exercise TMA store.
-            T.copy(C_local, C_shared)
+            # Stage via shared to exercise 1D TMA store
+            T.copy(C_local, C_shared)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 50df72b and 616e8e9.

📒 Files selected for processing (1)

examples/elementwise/example_elementwise_add_tma_1d.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/elementwise/example_elementwise_add_tma_1d.py (4)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/language/allocate.py (2)

alloc_shared (20-35)

alloc_fragment (52-63)

tilelang/language/copy.py (1)

copy (84-152)

tilelang/language/parallel.py (1)

Parallel (8-28)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-test-amd

🔇 Additional comments (1)

examples/elementwise/example_elementwise_add_tma_1d.py (1)

11-13: JIT wrapper usage looks good.

Returning the prim_func via @tilelang.jit(out_idx=[-1]) aligns with existing examples.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/dequantize_gemm/test_example_dequantize_gemm.py (1)
19-23: Hopper gating is correct; consider shrinking problem size to speed up CI

The example’s main() benchmarks with 500 warmups. Calling it with smaller sizes keeps this test snappy on H100 while preserving coverage.

Apply this diff:
-    example_dequant_gemm_bf16_mxfp4_hopper.main()
+    # Smaller problem to keep CI runtime low; functionality unchanged.
+    example_dequant_gemm_bf16_mxfp4_hopper.main(m=128, n=128, k=128, scale_size=32, fast_dequant=True, with_bias=False, tune=False)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 616e8e9 and f611c9f.

📒 Files selected for processing (1)

examples/dequantize_gemm/test_example_dequantize_gemm.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/dequantize_gemm/test_example_dequantize_gemm.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (95-96)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (2)

main (314-371)

main (481-546)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-amd
GitHub Check: bot-task

🔇 Additional comments (1)

examples/dequantize_gemm/test_example_dequantize_gemm.py (1)

5-5: Import of BF16 MXFP4 Hopper example — LGTM

Consistent with existing example imports; no issues spotted.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f611c9f and fa4fad3.

📒 Files selected for processing (1)

examples/gdn/example_wy_fast_bwd_split.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/gdn/example_wy_fast_bwd_split.py (1)

examples/gdn/utils.py (1)

assert_similar (18-40)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-nvidia
GitHub Check: build-test-amd

coderabbitai · 2025-08-27T13:17:20Z

examples/gdn/example_wy_fast_bwd_split.py

    dg_tilelang = dg_tilelang + dg_tilelang_A_positive.sum(dim=-1) - dg_tilelang_A_negative.sum(
        dim=-1)

+    from utils import assert_similar


💡 Verification agent

🧩 Analysis chain

Make utils import robust across run modes (script vs. module).

from utils import assert_similar will fail when running from repo root (utils.py lives under examples/gdn/). Keep the lazy import, but add a safe fallback to handle both python -m and direct script execution.

Apply this diff:

- from utils import assert_similar + try: + from .utils import assert_similar # when run as a module (python -m) + except Exception: + import os, sys + sys.path.insert(0, os.path.dirname(__file__)) + from utils import assert_similar

To verify, check package layout (no execution):

🏁 Script executed:

#!/bin/bash # Expect: utils exists under examples/gdn; __init__.py may or may not exist. fd -a 'utils.py' -x echo fd -a '__init__.py' examples examples/gdn -x echo || true rg -n '^from utils import assert_similar\b' -n examples/gdn/example_wy_fast_bwd_split.py

Length of output: 675

🏁 Script executed:

#!/bin/bash # Verify the import line with correct indentation rg -nP -C2 '^\s*from\s+utils\s+import\s+assert_similar' examples/gdn/example_wy_fast_bwd_split.py # Confirm utils.py lives alongside the example ls -la examples/gdn # Show the context around the import at lines ~506 sed -n '490,520p' examples/gdn/example_wy_fast_bwd_split.py

Length of output: 2747

Ensure assert_similar import works in both script and module modes

The plain from utils import assert_similar import only succeeds when you run the file directly (since Python sets sys.path[0] to the script’s directory), but it will fail under module mode (e.g. python -m examples.gdn.example_wy_fast_bwd_split) because there is no top-level utils.py at the project root. To handle both cases, try a relative import first, and fall back to adjusting sys.path for standalone script execution.

• Confirmed presence of examples/gdn/utils.py
• No __init__.py in examples/gdn (uses PEP 420 namespace package)
• Relative import (from .utils) works under module mode; raises ImportError when run as __main__

Apply this diff at line 506 of examples/gdn/example_wy_fast_bwd_split.py:

- from utils import assert_similar + try: + from .utils import assert_similar # works in module mode + except ImportError: + import os, sys + sys.path.insert(0, os.path.dirname(__file__)) + from utils import assert_similar # fallback for direct script run

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from utils import assert_similar

try:

from .utils import assert_similar # works in module mode

except ImportError:

import os, sys

sys.path.insert(0, os.path.dirname(__file__))

from utils import assert_similar # fallback for direct script run

🤖 Prompt for AI Agents

In examples/gdn/example_wy_fast_bwd_split.py around line 506, the current plain import "from utils import assert_similar" fails in module mode; change it to first attempt a relative import (from .utils import assert_similar) inside a try/except ImportError, and in the except block modify sys.path to include the script's parent package directory (or use importlib to load examples.gdn.utils) so that running the file as a script still finds examples/gdn/utils.py; ensure you import sys and os as needed and keep the fallback only for script execution to avoid breaking package imports.

* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes. - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management. - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body. - Removed obsolete code and improved overall code clarity and maintainability. * lint fix * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves. - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations. * test fix * Enhance global read detection in pipeline planning - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses. - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis. - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code. * Add IndexLegalizer to enforce int64 for out-of-bound indices - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds. - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability. - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations. * [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow. * Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job. * fix: NVRTC backend (#717) * fix: NVRTC backend * fix: CI --------- Co-authored-by: LeiWang1999 <[email protected]> * [CUDA] Init support for sm_120 (#716) * Init support for sm120 * fmt * resolve comments * unify mma gemm * fmt --------- Co-authored-by: LeiWang1999 <[email protected]> * [CI] fix docs ci (#720) * [Chore] fix typos (#719) * chore: fix typos * chore: fix ruff * chore: fix clang-format * [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Update AMD FlashAttention example and TVM submodule - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support. - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads. - Updated the TVM submodule to the latest commit for improved functionality. - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters. * Add CI workflow for automated format checking and testing - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests. - The workflow includes steps for setting up a Python environment, running format checks, and executing tests. - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory. * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity. * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management. * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management. * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment. * Add installation of ROCm in AMD CI workflow - Included a step to execute the `install_rocm.sh` script for improved setup. - Removed unnecessary blank line for better readability in the workflow script. * Remove installation step for ROCm in AMD CI workflow to simplify the setup process. * Update AMD CI workflow to run specific test file with verbose output instead of all tests. * Add new tilelang built-in operations for AMD architecture - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang. - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects. * Enhance autotuner configurations and GEMM operations in AMD example - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning. - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications. * Update autotuner configurations in AMD example for enhanced performance - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning. - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns. * Enhance autotuner configurations and memory handling in AMD example - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities. - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations. * Refine autotuner configurations and memory usage in AMD example - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning. - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations. * Update autotuner configurations in AMD example for enhanced performance - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities. - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations. * Enhance autotuner configurations and GEMM operations in AMD example - Expanded thread counts in `get_configs` to include higher values for improved autotuning. - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications. * Update AMD CI workflow and remove obsolete test script - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu. - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure. * Remove TVM subproject from 3rdparty directory * Refactor configuration generation and accumulation logic in AMD example - Reformatted the `get_configs` function for improved readability by aligning parameters. - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions. * Enhance AMD CI workflow with additional logging and setup steps - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements. - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed. * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps * Enhance BuildTileLangHIP function by adding whitespace for improved readability * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * lint fix * Update AMD CI workflow to use requirements-rocm.txt for dependency installation * fix ci * Remove dependency on format-check from AMD CI workflow * fix ci * fix ci * fix ci * Remove format-check job from AMD CI workflow * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow * Add dependency on format-check job in AMD CI workflow * Add format-check job to AMD CI workflow * Update format-check job in AMD CI workflow to run on self-hosted environment * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes * Update amd_ci.yml --------- Co-authored-by: xinxyxiao <[email protected]> Co-authored-by: Lei Wang <[email protected]> Co-authored-by: LeiWang1999 <[email protected]> * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724) * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy * [Typo] Correct architecture selection for CUDA and CDNA * [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor CUDA code generation to simplify eviction policy handling - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity. - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations. * lint fix * [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Support strided tensors * Refactor target attribute helper functions for improved clarity * No code changes made in proxy.py and setup.py * lint fix * lint fix via gemini * lint fix * test fix * test fix * lint fix * Update wrapper.py * test fix * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity. * lint fix --------- Co-authored-by: Chenggang Zhao <[email protected]> * [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712) * bug fix and support gemm_sr fallback for hopper * Update gemm.cc --------- Co-authored-by: Lei Wang <[email protected]> Co-authored-by: LeiWang1999 <[email protected]> * 📝 Add docstrings to `fix` (#726) Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851 The following files were modified: * `src/op/gemm.cc` * `src/tl_templates/cuda/gemm_sm90.h` * `src/transform/warp_specialized_rewriter.cc` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [CI] Fix AMD CI (#729) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. --------- Co-authored-by: xinxyxiao <[email protected]> Co-authored-by: Lei Wang <[email protected]> * [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by: Lei Wang <[email protected]> Co-authored-by: LeiWang1999 <[email protected]> * 📝 Add docstrings to `mxfp4` (#732) * 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561 The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <[email protected]> * [Refactor] Refactor env into a more flexible version (#740) * Fix environment variable name for compilation print setting in `env.py` * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity. * lint fix * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py` * [Enhancement] Add stride index validation in CythonKernelWrapper (#743) * Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`. * This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code. * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742) * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error * Update atomicadd_vectorize.cc * format * 📝 Add docstrings to PR #744 (#745) * 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559 The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <[email protected]> * [Refactor] Refactor barrier management (#744) * Introduce Barrier * Enhance CUDA kernel with new barrier management and post-processing support - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers. - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure. - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency. - Introduced additional print statements for debugging in the lowering phase of the TileLang engine. - Enhanced the overall structure and readability of the codebase. * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic. * Enhance barrier management in TileLang - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework. - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory. - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code. - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine. - Removed deprecated memory scope handling code to enhance clarity and maintainability. * lint fix * lint fix * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability. * Refactor logging in JITKernel to improve kernel compilation tracking - Removed unused import of `torch.backends` in the example file. - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging. - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function. * Refactor dequantization tests and update barrier function - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite. - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management. * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed. * Fix typos in rasterization parameters and update import path for cached module - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage. - Updated the import statement for the `cached` module to reflect the new path in the cache submodule. - Added `StridedTensor` import in the language module for enhanced tensor functionality. * Update ci.yml * [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable. * lint fix * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency. * remove bulk copy * Refactor copy and atomic add operations to support TMA lower configuration - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach. * Enhance TMA bulk copy logic in `LowerBulkCopy` method - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities. * lint fix * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions. * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling. * Update `native_sparse_attention` function to include TMA configuration options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1. * Refactor JIT decorator formatting in `native_sparse_attention` function - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability. * Enhance thread management and logging in TileLang compilation - Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability. * Add warp specialization scope and refactor register management in TileLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process. * Refactor test for InjectSetMaxNReg pass in TileLang - Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability. * Enhance bulk copy and store checks in `Copy` class - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging. * lint fix * [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency. * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency. * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability. * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality. * typo fix * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues. * lint fix * test fix * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation. * use ninja instead of make * Add CMake configuration step for Ninja build system in setup.py * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja. * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility. * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks. * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly. * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes. * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions. * bugfix * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp. * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts. * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality. * [Enhancement] Optimize loop body handling in IR (#749) - Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable. - This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution. Co-authored-by: LeiWang1999 <[email protected]> * [MXFP4] Fix bugs and optimize exponential operation (#750) * [MXFP4] Fix bugs - Optimize exp2 with shift operation to boost performance - Fix bug of simple dequantization function call - Fix bug of scaling factor with bias * [Lint] --------- Co-authored-by: LeiWang1999 <[email protected]> * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751) - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. * [Enhancement] Add shape checking for reduce options (#748) * Add shape checking for reduce options * lint fix * Handle special case reducing into shape-1 tensor Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y] --------- Co-authored-by: LeiWang1999 <[email protected]> * [Bugfix] Add missing FP8 header include (#752) * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. Co-authored-by: LeiWang1999 <[email protected]> * [Enhancement] Include cuda_fp8.h in gemm_sm90.h - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types. Co-authored-by: LeiWang1999 <[email protected]> * lint fix * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase. - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations. - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability. * lint fix * [Refactor] Update header inclusions in common.h and gemm_sm90.h - Removed the inclusion of "intrin.h" from common.h to streamline dependencies. - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability. * bug fix * [MXFP4] Add bias to MXFP4 GEMM kernel (#753) * [MXFP4] Add bias to gemm kernel * [Lint] * [Lint] Rename "bias" to "Bias" * [Bugfix][WS] Consider loop min extent when computing phase id (#754) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * [Typo] Remove `disable_cache` in some tests (#755) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability. * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Feature] Add 1D TMA support (#761) * [Feature] Add 1D TMA support - Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example * [Lint] * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors * [Lint] * [BugFix] 1D TMA load * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Lint] * [Lint] * [MXFP4] Add test for bf16&mxfp4 gemm * [BugFix] * [Lint] --------- Co-authored-by: Yu Cheng <[email protected]> Co-authored-by: Johnny <[email protected]> * [Example] Add vertical slash sparse attention pattern (#762) * upd sparse attn * lint * rename * update test file * update benchmark * lint * update benchmark * [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767) * fix ci and pass bug * fix * try * lint * [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766) * [TMA] Add 1D TMA copy for Scale tensor * [Lint] * [Test] Add test for kernel * [BugFix] * hot fix blackwell (#768) * [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763) * Refactor operator classes to inherit from TileOperator and update layout inference methods - Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations. - Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency. - Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization. - Added missing layout inference implementations for Fill and Conv2DIm2ColOp. - Removed deprecated op.cc and op.h files to streamline the codebase. * lint fix * Refactor operator classes to use Node pattern and improve memory management - Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation. - Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access. - Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design. - Refactored InferLayout and Lower methods to ensure consistency across operator implementations. - Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase. * Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning - Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects. - Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations. - Made minor adjustments in layout inference and other related methods for consistency and clarity. * Refactor FillNode::Lower method to remove unused global function call - Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity. - Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies. * [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) * [Enhancement] Introduce finalize_reducer operator and layout reducer support - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations. - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management. - Updated `setup.py` to include logging for build directory paths, improving build process visibility. - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations. - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts. - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance. * Refactor code formatting and improve readability in multiple files - Cleaned up whitespace in `setup.py` to enhance logging clarity. - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability. - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability. - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity. * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability. * [Enhancement] Disable reuse of small arrays in shared memory allocation - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management. * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability. * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability. * bug fix * Add thread checks workaround for replicated cases * Remove the is_one check * fix lint error * lint fix * Update autotune tests to use smaller matrix sizes for improved performance and reliability * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure. - Enhanced layout inference and cloning methods in FinalizeReducerOpNode. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main. - Adjusted header inclusions for improved organization and clarity across multiple files. * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization. * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations. - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus. --------- Co-authored-by: Huanqi Cao <[email protected]> Co-authored-by: Freebase6912 <[email protected]> * 📝 Add docstrings to `pytile_0826` (#770) * 📝 Add docstrings to `pytile_0826` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814 The following files were modified: * `src/op/atomic_add.cc` * `src/op/atomic_add.h` * `src/op/copy.cc` * `src/op/copy.h` * `src/op/elem.cc` * `src/op/elem.h` * `src/op/gemm.cc` * `src/op/gemm.h` * `src/op/gemm_sp.cc` * `src/op/gemm_sp.h` * `src/op/operator.cc` * `src/op/operator.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/op/reduce.h` * `src/op/region.cc` * `src/op/region.h` * `src/transform/layout_inference.cc` * `src/transform/lower_tile_op.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <[email protected]> * [Bugfix]:Fix atomic add auto vectorize negative optimization (#765) * [Bugfix]:Fix atomic add auto vectorize negative optimization * fixbug * format * fix bug * 📝 Add docstrings to `reducer_0825` (#772) * 📝 Add docstrings to `reducer_0825` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118 The following files were modified: * `setup.py` * `src/op/builtin.h` * `src/op/finalize_reducer.cc` * `src/op/finalize_reducer.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/target/codegen_cuda.cc` * `src/tl_templates/cuda/common.h` * `src/transform/layout_inference.cc` * `src/transform/layout_reducer.cc` * `src/transform/layout_reducer.h` * `src/transform/merge_shared_memory_allocations.cc` * `src/transform/storage_access.cc` * `src/transform/warp_specialized_rewriter.cc` * `testing/python/autotune/test_tilelang_autotune_with_inputs.py` * `tilelang/engine/phase.py` * `tilelang/language/customize.py` * `tilelang/language/reduce.py` * `tilelang/transform/__init__.py` * lint fix * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <[email protected]> * Allow fill global buffer (#774) * Allow fill global buffer * fix lint error * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771) * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match * [Lint] * add bf16 exp fallback (#776) * [Lint] Introduce clang-tidy into format.sh (#777) * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization. - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity. - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions. - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code. - General code cleanup and adherence to best practices for better maintainability. * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration - Updated .clang-tidy configuration to include additional checks for improved code quality and performance. - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity. - Replaced size checks with `empty()` method calls in various locations for clearer intent. - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency - Added clang-tidy checks to the format script for improved code quality assurance. - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity. - Updated the requirements-lint.txt file to include clang-tidy as a dependency. - General code cleanup to adhere to best practices and improve maintainability. * [CI] Update AMD CI Workflow to Include Build Directory Creation - Added steps to create a build directory and configure CMake with ROCm support during the format check process. - Ensured cleanup of the build directory after the format check to maintain a clean workspace. * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members. - This change enhances code clarity and maintainability by focusing on relevant attributes for each class. * [Refactor] Update Clang-Tidy Integration and Code Improvements - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes. - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures. - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity. - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency. - General code cleanup to maintain consistency and adhere to best practices. * [CI] Add Git Submodule Initialization to CI Workflow - Included a step to initialize and update git submodules recursively in the CI workflow. - This change ensures that all necessary submodules are available during the format check process, improving build reliability. * [CI] Add Git Submodule Update Step to Format Check - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process. - This enhancement ensures that all required submodules are available, contributing to improved build reliability. * [Refactor] Update Function Signatures in AtomicAddVectorize - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity. - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase. * [Cache] Introduce detailed target information for the disk kernel cache (#780) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * [Example]Adds example for top-k operation (#775) * [Example]Adds example for top-k operation Adds an example demonstrating the top-k operation using tilelang * format * Adds topk tilelang example test * fix lint * [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h * lint fix * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages * lint fix * Add cmake dependency to pyproject.toml and improve build logging in setup.py * [CI] Adds pytest-durations for test timing (#782) * [Ci] Adds pytest-durations for test timing Adds `pytest-durations` to the test requirements and configures pytest to display test durations. This helps in identifying slow-running tests and optimizing the test suite for faster feedback. * add amd ci durations * Removes flash_attn installation from CI * [Refactor] Support python reflection for tile operators (#783) * Implement Fill operator and related reflection methods in TileLang - Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers. - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities. - Updated relevant files to register reflection methods and ensure proper initialization in static blocks. - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability. - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly. * Refactor operator reflection methods and improve code clarity - Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks. - Consolidated static initialization blocks for various operators to a single line for improved consistency. - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability. - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports. * Refactor GEMM and AtomicAdd operations for improved clarity - Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety. - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method. - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity. - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports. * Remove deprecated operator files from the tilelang IR module - Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase. - This cleanup enhances maintainability by removing unused code and improving overall organization of the module. * Refactor imports in tilelang IR module for improved organization - Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase. * lint fix * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability. - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code. - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity. - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities. - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities. * Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure. - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization. - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability. - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability. * comment * Refactor operator header files for improved readability - Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity. - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions. * Refactor MakeReduce method in ReduceOpNode for clarity - Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability. - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation. * Update Reduce operation type checks for consistency - Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity. - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase. * [AMD] Fix amd tir&add examples (#784) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. * Add new AMD FlashAttention example and test script - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability. * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner - Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings. * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217 * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c * Add example for AMD Flash Attention backward pass implementation - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications. * Enhance AMD Flash Attention example with additional testing capabilities - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications. * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * Refactor HIP intrinsic rules to CUDA - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure. * Update AMD CI workflow to uninstall specific PyTorch packages before installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts. * Remove unused shared memory allocations in AMD Flash Attention backward example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead. * Remove unnecessary pip uninstall command from AMD CI workflow - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management. * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues. * Refactor formatting of HIP intrinsic rule registrations - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability. * Update file na…

* [Feature] Add 1D TMA support - Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example * [Lint] * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors * [Lint] * [BugFix] 1D TMA load * [README] Update GDN README for clarity and add acknowledgements (tile-ai#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (tile-ai#760) * [Lint] * [Lint] * [MXFP4] Add test for bf16&mxfp4 gemm * [BugFix] * [Lint] --------- Co-authored-by: Yu Cheng <[email protected]> Co-authored-by: Johnny <[email protected]>

tzj-fxz added 2 commits August 26, 2025 10:55

[Feature] Add 1D TMA support

8af8b7e

- Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example

[Lint]

31c66b8

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

tzj-fxz requested a review from LeiWang1999 August 26, 2025 11:05

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

tzj-fxz added 2 commits August 27, 2025 10:38

[BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors

3ef6139

[Lint]

4eff7bd

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

tzj-fxz and others added 3 commits August 27, 2025 11:01

[BugFix] 1D TMA load

87a1de6

cutlass v4.2.0 supporting cuda 13 (tile-ai#760)

a924ccf

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

[Lint]

50df72b

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

examples/elementwise/example_elementwise_add_tma_1d.py Show resolved Hide resolved

[Lint]

616e8e9

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

[MXFP4] Add test for bf16&mxfp4 gemm

f611c9f

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

tzj-fxz added 2 commits August 27, 2025 13:08

[BugFix]

3c187bc

[Lint]

fa4fad3

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

LeiWang1999 approved these changes Aug 28, 2025

View reviewed changes

LeiWang1999 merged commit 1774a1a into tile-ai:main Aug 28, 2025
5 of 6 checks passed

This was referenced Sep 1, 2025

[Lint] Introduce clang-tidy into format.sh #777

Merged

[TMA] Automatically lower 1d tma in appropriate cases #788

Merged

coderabbitai bot mentioned this pull request Sep 9, 2025

[TileOp] Introduce a experimental python defined T.gemm_v2 #793

Merged

29 tasks

coderabbitai bot mentioned this pull request Sep 23, 2025

[SM89] Lint fix #868

Merged

This was referenced Oct 21, 2025

[Bugfix] Fix missing host cuTensorMapEncodeIm2col call #1094

Merged

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi #1108

Merged

This was referenced Nov 6, 2025

[Refactor] Add kernel selection option for GEMM v1 in environment settings #1200

Merged

[Refactor] Update buffer handling in copy and atomic operations #1247

Merged

coderabbitai bot mentioned this pull request Nov 14, 2025

[fix] NVRTC execution backend #1256

Merged

		The [chunk_delta_h](common/chunk_delta_h.py) implements the most critical forward kernel of GDN. It's a good start to understand the GDN logic and the TileLang optimization.

-    from utils import assert_similar
+    try:
+        from .utils import assert_similar       # works in module mode
+    except ImportError:
+        import os, sys
+        sys.path.insert(0, os.path.dirname(__file__))
+        from utils import assert_similar         # fallback for direct script run

[Feature] Add 1D TMA support #761

[Feature] Add 1D TMA support #761

Conversation

tzj-fxz commented Aug 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

github-actions bot commented Aug 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

tzj-fxz commented Aug 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 26, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)