Skip to content

[Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass#1677

Merged
LeiWang1999 merged 6 commits intotile-ai:mainfrom
LeiWang1999:atomic_0116
Jan 16, 2026
Merged

[Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass#1677
LeiWang1999 merged 6 commits intotile-ai:mainfrom
LeiWang1999:atomic_0116

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Jan 15, 2026

Summary

  • Move atomic add vectorization logic from atomicadd_vectorize.cc to the unified vectorize_loop.cc pass
  • This consolidates vectorization transformations into a single pass for better maintainability
  • The atomicadd_vectorize.cc now focuses only on atomic-specific rewrites while general vectorization is handled by vectorize_loop.cc

Test plan

  • Run existing atomic add tests
  • Verify vectorization behavior remains unchanged
  • Run regression tests

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added vectorized atomic-add support with broader type flexibility and vector-return variants (2×/4× components), improving GPU kernel expressiveness.
  • Refactor

    • Generalized atomic-add APIs to accept more input/output types and unified handling across architectures.
    • Simplified parallel-loop vectorization by removing an automatic post-vectorization atomic-add pass while keeping predicate wrapping.
  • Tests

    • Added tests covering auto-vectorized and complex parallel atomic-add scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 15, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

The PR generalizes CUDA atomic-add APIs with templated, vectorized overloads (x2/x4/Ret variants and ToFloat helpers) and moves atomic-add vectorization into the TLVectorizer by adding helpers and expr handlers, while removing the separate atomicadd_vectorize pass and its invocation from loop partitioning.

Changes

Cohort / File(s) Summary
CUDA atomic header
src/tl_templates/cuda/atomic.h
Added templated overloads for AtomicAddx2/AtomicAddx2Ret and AtomicAddx4/AtomicAddx4Ret (templated src_type/dst_dtype/ValType), ToFloat2/ToFloat4 helpers, vector-return variants (float2/float4), and non-relaxed/PTX fallback paths under arch guards.
Vectorization pass
src/transform/vectorize_loop.cc
Added helpers ExtractBufferLoadForAtomic, GetVectorizedAtomicOp, GetMaxAtomicVectorSize; extended TLVectorizer with MutateAddressOfCall_ and MutateAtomicAddExpr_; integrated address_of and atomic_add_elem_op handling to emit vectorized atomic-add calls.
Removed vectorize module
src/transform/atomicadd_vectorize.cc, src/transform/atomicadd_vectorize.h
Deleted the separate atomic-add vectorization implementation and its header (rewriter, ExtractBufferLoad, GetVectorizedAtomicOp, VectorizeAtomicAdd entry).
Loop partitioning pass
src/transform/loop_partition.cc
Removed the VectorizeAtomicAdd(result_loop) invocation from LowerParallelLoop, keeping predicate-wrapping and returning the transformed loop.
Atomic add op include
src/op/atomic_add.cc
Removed include of atomicadd_vectorize.h (header deleted).
Tests
testing/python/language/test_tilelang_language_atomic.py
Added tests/runners for auto-vectorized atomic-add and a more complex parallel atomic-add program that assert generated kernel contains float4/AtomicAddx4.

Sequence Diagram(s)

sequenceDiagram
    participant LoopPartition as LoopPartition
    participant TLVectorizer as TLVectorizer
    participant CUDAAtomics as CUDA_Atomic_Templates

    LoopPartition->>TLVectorizer: invoke vectorization on loop (no separate VectorizeAtomicAdd)
    TLVectorizer->>TLVectorizer: detect address_of / atomic_add_elem_op in exprs
    TLVectorizer->>CUDAAtomics: emit vectorized AtomicAddx2/AtomicAddx4 (templated ValType/dst)
    CUDAAtomics-->>TLVectorizer: vectorized atomic call form (codegen-ready)
    TLVectorizer-->>LoopPartition: return transformed loop containing atomic vector calls
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • Rachmanino

Poem

🐇
"I hop through code in morning light,
templates stitched and vectors bright,
atom by atom I rearrange,
loops now sing in vector range,
a tiny rabbit cheers the change!"

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately summarizes the main refactoring change: moving atomic add vectorization logic from a separate pass (atomicadd_vectorize.cc) into the unified vectorize_loop.cc pass.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/transform/loop_partition.cc (1)

300-305: Remove unused atomicadd_vectorize.h include on line 32.

The removal of the VectorizeAtomicAdd call correctly aligns with the PR objective of consolidating vectorization into the unified vectorize_loop.cc pass. However, the corresponding include for atomicadd_vectorize.h on line 32 is now unused and should be removed—no symbols from that header are referenced elsewhere in this file.

🤖 Fix all issues with AI agents
In `@src/transform/vectorize_loop.cc`:
- Around line 562-569: The code calls ExtractBufferLoadForAtomic(dst) and
immediately uses dst_buffer_load.value(), which can be empty; add a null check
(e.g., test dst_buffer_load.has_value() or equivalent) before accessing
.value(), and if it's empty follow the existing fallback (return
tvm::ffi::GetRef<PrimExpr>(op)); update the block around dst_buffer_load, Target
target, and max_vec_size to bail out early when no BufferLoad was extracted to
avoid undefined behavior.
🧹 Nitpick comments (1)
src/tl_templates/cuda/atomic.h (1)

746-765: Type safety relies on caller constraints.

The fallback implementations assume dst_dtype is float-compatible since atomicAdd is called with float values from float4. This works correctly because GetMaxAtomicVectorSize only enables vector size 4 for float32 types. Consider adding a static_assert for extra safety.

💡 Optional: Add compile-time type check
 template <typename src_dtype, typename dst_dtype>
 TL_DEVICE void AtomicAddx4(dst_dtype *ref, src_dtype *val,
                            int memory_order = int(cuda::memory_order_relaxed)) {
+  static_assert(std::is_same_v<dst_dtype, float>,
+                "AtomicAddx4 fallback only supports float destination type");
   (void)memory_order;
   float4 add_val = *reinterpret_cast<float4 *>(val);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b27fb92 and 639e8e4.

📒 Files selected for processing (3)
  • src/tl_templates/cuda/atomic.h
  • src/transform/loop_partition.cc
  • src/transform/vectorize_loop.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/vectorize_loop.cc
🧬 Code graph analysis (1)
src/transform/vectorize_loop.cc (1)
src/transform/atomicadd_vectorize.cc (4)
  • dtype (64-73)
  • dtype (64-64)
  • GetVectorizedAtomicOp (39-48)
  • GetVectorizedAtomicOp (39-39)
🔇 Additional comments (6)
src/transform/vectorize_loop.cc (4)

40-44: LGTM: Appropriate includes added for atomic vectorization support.

The new includes are correctly added to support the atomic vectorization functionality being moved into this file.


124-168: LGTM: Helper functions correctly mirror the original atomicadd_vectorize.cc implementations.

The logic for GetVectorizedAtomicOp and GetMaxAtomicVectorSize aligns with the existing implementations in atomicadd_vectorize.cc, ensuring consistent behavior as the vectorization is consolidated into this pass.


501-526: LGTM: Address normalization logic is correct.

The substitution of var_ with 0 correctly removes the vectorized index component to yield the base address for atomic operations.


595-599: LGTM: Expression visitor correctly routes atomic operations.

The new dispatch cases properly integrate atomic add vectorization into the existing expression visitor pattern.

src/tl_templates/cuda/atomic.h (2)

567-608: Template generalization looks correct given usage constraints.

The template parameters src_dtype and dst_dtype are added for API flexibility, but the implementation still assumes float-compatible types via reinterpret_cast<float4*>. This is safe because GetMaxAtomicVectorSize only returns 4 for float32 types on SM >= 90, ensuring these functions are only called with float pointers.


610-646: New float4 value overload is a useful addition.

This overload simplifies call sites that already have the value in float4 form, avoiding an unnecessary address-of operation.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +562 to +569
auto dst_buffer_load = ExtractBufferLoadForAtomic(dst);
Target target = Target::Current(false);
int max_vec_size =
GetMaxAtomicVectorSize(dst_buffer_load.value()->buffer->dtype, target);
if (vector_size > max_vec_size) {
// Vector size not supported for this dtype, cannot vectorize
return tvm::ffi::GetRef<PrimExpr>(op);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing null check on Optional<BufferLoad> before calling .value().

ExtractBufferLoadForAtomic(dst) returns an Optional<BufferLoad>. If dst is neither a BufferLoad nor an address_of wrapping a BufferLoad, the returned Optional will be empty, and calling .value() on line 565 will cause undefined behavior.

🐛 Proposed fix to add null check
     auto dst = VisitExpr(op->args[0]);
     auto src = VisitExpr(op->args[1]);
     // Check if dtype supports this vector size
     auto dst_buffer_load = ExtractBufferLoadForAtomic(dst);
+    if (!dst_buffer_load.defined()) {
+      // Cannot extract buffer load, fall back to non-vectorized path
+      return tvm::ffi::GetRef<PrimExpr>(op);
+    }
     Target target = Target::Current(false);
+    if (!target.defined()) {
+      // No target available, fall back to non-vectorized path
+      return tvm::ffi::GetRef<PrimExpr>(op);
+    }
     int max_vec_size =
         GetMaxAtomicVectorSize(dst_buffer_load.value()->buffer->dtype, target);
🤖 Prompt for AI Agents
In `@src/transform/vectorize_loop.cc` around lines 562 - 569, The code calls
ExtractBufferLoadForAtomic(dst) and immediately uses dst_buffer_load.value(),
which can be empty; add a null check (e.g., test dst_buffer_load.has_value() or
equivalent) before accessing .value(), and if it's empty follow the existing
fallback (return tvm::ffi::GetRef<PrimExpr>(op)); update the block around
dst_buffer_load, Target target, and max_vec_size to bail out early when no
BufferLoad was extracted to avoid undefined behavior.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors atomic add vectorization by moving the transformation logic from the separate atomicadd_vectorize.cc pass into the unified vectorize_loop.cc pass. The refactor consolidates vectorization transformations for better maintainability while preserving the same functionality.

Changes:

  • Atomic add vectorization logic integrated into TLVectorizer class in vectorize_loop.cc
  • Removed separate VectorizeAtomicAdd call from loop partitioning pipeline
  • Added template overloads in atomic.h to support float4 value parameter

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/transform/vectorize_loop.cc Adds atomic add vectorization methods (MutateAtomicAddExpr_, MutateAddressOfCall_) and helper functions to the TLVectorizer class
src/transform/loop_partition.cc Removes the separate VectorizeAtomicAdd call and associated include, as vectorization is now handled by VectorizeLoop
src/tl_templates/cuda/atomic.h Adds template overloads for AtomicAddx4 and AtomicAddx4Ret accepting float4 values directly
Comments suppressed due to low confidence (1)

src/transform/loop_partition.cc:32

  • The include for "atomicadd_vectorize.h" is still present but no longer used since VectorizeAtomicAdd has been removed from this file. This include should be removed.
#include "atomicadd_vectorize.h"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if (memory_order == int(cuda::memory_order_release) ||
memory_order == int(cuda::memory_order_consume)) {
asm volatile(
"atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], "
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PTX instruction format is incorrect. The order should be "atom.release.gpu.global.add.v4.f32" not "atom.global.gpu.release.add.v4.f32". The memory order (release) should come before the scope (global), matching the format used in lines 624, 631, and 639 for the similar overload.

Copilot uses AI. Check for mistakes.
: "memory");
} else if (memory_order == int(cuda::memory_order_acquire)) {
asm volatile(
"atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], "
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PTX instruction format is incorrect. The order should be "atom.acquire.gpu.global.add.v4.f32" not "atom.global.gpu.acquire.add.v4.f32". The memory order (acquire) should come before the scope (global), matching the format used in line 630 for the similar overload.

Copilot uses AI. Check for mistakes.
} else if (memory_order == int(cuda::memory_order_acq_rel) ||
memory_order == int(cuda::memory_order_seq_cst)) {
asm volatile(
"atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], "
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PTX instruction format is incorrect. The order should be "atom.acq_rel.gpu.global.add.v4.f32" not "atom.global.gpu.acq_rel.add.v4.f32". The memory order (acq_rel) should come before the scope (global), matching the format used in lines 638-639 for the similar overload.

Suggested change
"atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], "
"atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "

Copilot uses AI. Check for mistakes.
auto dst = VisitExpr(op->args[0]);
auto src = VisitExpr(op->args[1]);
// Check if dtype supports this vector size
auto dst_buffer_load = ExtractBufferLoadForAtomic(dst);
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential crash if dst_buffer_load is not defined. The code calls .value() without checking if the Optional has a value. Add a check: if (!dst_buffer_load.defined()) { return tvm::ffi::GetRef<PrimExpr>(op); } before line 563.

Suggested change
auto dst_buffer_load = ExtractBufferLoadForAtomic(dst);
auto dst_buffer_load = ExtractBufferLoadForAtomic(dst);
if (!dst_buffer_load.defined()) {
return tvm::ffi::GetRef<PrimExpr>(op);
}

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/tl_templates/cuda/atomic.h (1)

712-752: Incorrect PTX instruction format in AtomicAddx4Ret.

The PTX instructions at lines 725, 733, and 742 have incorrect ordering. The correct PTX syntax is atom.sem.scope.space.op.type, but these instructions have atom.space.scope.sem.op.type.

Compare with the correct format used in AtomicAddx4 at lines 646, 654, 663: atom.release.gpu.global.add.v4.f32

Proposed fix
     if (memory_order == int(cuda::memory_order_release) ||
         memory_order == int(cuda::memory_order_consume)) {
-      asm volatile("atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.release.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
                    "{%5,%6,%7,%8};"
                    : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z),
                      "=f"(ret_val.w)
                    : "l"(ref_addr), "f"(add_val.x), "f"(add_val.y),
                      "f"(add_val.z), "f"(add_val.w)
                    : "memory");
     } else if (memory_order == int(cuda::memory_order_acquire)) {
-      asm volatile("atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.acquire.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
                    "{%5,%6,%7,%8};"
                    : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z),
                      "=f"(ret_val.w)
                    : "l"(ref_addr), "f"(add_val.x), "f"(add_val.y),
                      "f"(add_val.z), "f"(add_val.w)
                    : "memory");
     } else if (memory_order == int(cuda::memory_order_acq_rel) ||
                memory_order == int(cuda::memory_order_seq_cst)) {
-      asm volatile("atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
                    "{%5,%6,%7,%8};"
🤖 Fix all issues with AI agents
In `@src/tl_templates/cuda/atomic.h`:
- Around line 600-629: The function AtomicAddx2Ret currently declares an unused
template parameter src_type; remove the template so the function becomes a plain
(non-templated) function. Specifically, change the declaration/definition from
template<typename src_type> TL_DEVICE float2 AtomicAddx2Ret(...) to TL_DEVICE
float2 AtomicAddx2Ret(float *ref, float2 val, int memory_order =
int(cuda::memory_order_relaxed)), and update any corresponding forward
declarations/overloads to match; ensure no remaining references to src_type
remain in the AtomicAddx2Ret implementation or its declarations.
♻️ Duplicate comments (1)
src/tl_templates/cuda/atomic.h (1)

754-789: Incorrect PTX instruction format (same issue as lines 725-742).

The PTX instructions at lines 766, 773, and 781 have the same incorrect ordering issue. The memory order qualifier should precede the scope and space.

Proposed fix
     if (memory_order == int(cuda::memory_order_release) ||
         memory_order == int(cuda::memory_order_consume)) {
       asm volatile(
-          "atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+          "atom.release.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
           "{%5,%6,%7,%8};"
           : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z), "=f"(ret_val.w)
           : "l"(ref_addr), "f"(val.x), "f"(val.y), "f"(val.z), "f"(val.w)
           : "memory");
     } else if (memory_order == int(cuda::memory_order_acquire)) {
       asm volatile(
-          "atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+          "atom.acquire.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
           "{%5,%6,%7,%8};"
           : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z), "=f"(ret_val.w)
           : "l"(ref_addr), "f"(val.x), "f"(val.y), "f"(val.z), "f"(val.w)
           : "memory");
     } else if (memory_order == int(cuda::memory_order_acq_rel) ||
                memory_order == int(cuda::memory_order_seq_cst)) {
       asm volatile(
-          "atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+          "atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
           "{%5,%6,%7,%8};"
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 639e8e4 and 53f4a5d.

📒 Files selected for processing (1)
  • src/tl_templates/cuda/atomic.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (8)
src/tl_templates/cuda/atomic.h (8)

330-375: LGTM!

The templated AtomicAddx2 for half_t correctly uses the PTX instruction format (atom.release.gpu.global.add.noftz.v2.f16) and handles all memory ordering cases appropriately.


377-420: LGTM!

Correctly returns the old value from the atomic operation with proper PTX syntax.


422-505: LGTM!

The bfloat16 variants correctly implement vectorized atomic adds with proper PTX syntax for all memory orderings.


507-537: LGTM!

Correctly implements vectorized float2 atomic add with proper PTX instruction ordering.


539-565: LGTM!

Good addition of the by-value overload for float2, providing a cleaner API when the caller already has a vector value.


631-672: LGTM!

The dual-template AtomicAddx4 correctly supports cross-type vectorized operations with proper PTX syntax.


674-710: LGTM!

Good addition of the by-value float4 overload with correct PTX instruction ordering.


790-859: LGTM!

The fallback implementations for pre-SM90 architectures correctly decompose vectorized operations into scalar atomicAdd calls. The new templated overloads are consistent with the SM90+ versions.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

…undancy

- Introduced utility functions ToFloat2 and ToFloat4 for type conversion, enhancing the flexibility of atomic add operations.
- Updated AtomicAddx2 and AtomicAddx4 functions to accept generic value types, streamlining the code and improving readability.
- Removed outdated function overloads and consolidated logic for atomic addition, ensuring consistency across implementations.
- Enhanced memory order handling in atomic operations for better performance and clarity.
- Deleted the atomicadd_vectorize.cc and atomicadd_vectorize.h files, which contained the automatic vectorization pass for atomic add operations.
- Removed references to the deleted files in atomic_add.cc and loop_partition.cc to clean up the codebase and eliminate unused dependencies.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@testing/python/language/test_tilelang_language_atomic.py`:
- Around line 490-514: Add the missing runtime CUDA presence check to the test
by decorating test_atomic_add_complicated_parallel with
`@tilelang.testing.requires_cuda` in addition to the existing
`@tilelang.testing.requires_cuda_compute_version_ge`(9, 0); locate the test
function named test_atomic_add_complicated_parallel (and the helper
run_atomic_add_complicated_parallel / atomic_add_complicated_parallel_program)
and place the requires_cuda decorator immediately above the function so the test
skips on machines without a CUDA driver.
♻️ Duplicate comments (1)
src/tl_templates/cuda/atomic.h (1)

636-659: Inconsistent PTX instruction format in AtomicAddx4Ret.

The PTX instructions here use atom.global.gpu.<order>.add.v4.f32 but the correct format (matching AtomicAddx4 above at lines 596-619) should be atom.<order>.gpu.global.add.v4.f32. The memory order should come before the scope.

Suggested fix
     if (memory_order == int(cuda::memory_order_release) ||
         memory_order == int(cuda::memory_order_consume)) {
-      asm volatile("atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.release.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
                    "{%5,%6,%7,%8};"
                    : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z),
                      "=f"(ret_val.w)
                    : "l"(ref_addr), "f"(add_val.x), "f"(add_val.y),
                      "f"(add_val.z), "f"(add_val.w)
                    : "memory");
     } else if (memory_order == int(cuda::memory_order_acquire)) {
-      asm volatile("atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.acquire.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
                    "{%5,%6,%7,%8};"
                    : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z),
                      "=f"(ret_val.w)
                    : "l"(ref_addr), "f"(add_val.x), "f"(add_val.y),
                      "f"(add_val.z), "f"(add_val.w)
                    : "memory");
     } else if (memory_order == int(cuda::memory_order_acq_rel) ||
                memory_order == int(cuda::memory_order_seq_cst)) {
-      asm volatile("atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
                    "{%5,%6,%7,%8};"
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53f4a5d and 173fe4d.

📒 Files selected for processing (6)
  • src/op/atomic_add.cc
  • src/tl_templates/cuda/atomic.h
  • src/transform/atomicadd_vectorize.cc
  • src/transform/atomicadd_vectorize.h
  • src/transform/loop_partition.cc
  • testing/python/language/test_tilelang_language_atomic.py
💤 Files with no reviewable changes (3)
  • src/op/atomic_add.cc
  • src/transform/atomicadd_vectorize.cc
  • src/transform/atomicadd_vectorize.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-01-06T05:20:51.649Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1606
File: testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py:30-30
Timestamp: 2026-01-06T05:20:51.649Z
Learning: In `testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py`, the test validates that the `hoist_broadcast_values` transformation pass correctly identifies and hoists broadcast operations by checking for patterns in the generated kernel source code. The specific literal values used (e.g., 430) are not important for the test's purpose, as it does not validate numerical precision or actual stored tensor values.

Applied to files:

  • testing/python/language/test_tilelang_language_atomic.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_atomic.py
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_atomic.py (3)
tilelang/language/dtypes.py (1)
  • float32 (310-310)
tilelang/testing/__init__.py (1)
  • requires_cuda_compute_version_ge (107-108)
tilelang/language/atomic.py (1)
  • atomic_add (194-301)
🔇 Additional comments (7)
testing/python/language/test_tilelang_language_atomic.py (1)

479-487: LGTM! Test validates auto-vectorization of atomic add.

The test correctly verifies that the kernel source contains AtomicAddx4, confirming the vectorization pass works as expected.

src/tl_templates/cuda/atomic.h (5)

508-512: LGTM! Clean helper functions for type conversion.

ToFloat2 overloads provide a uniform interface for both pointer and value arguments, enabling the templated AtomicAddx2 to accept diverse input types.


576-580: LGTM! Analogous ToFloat4 helpers for x4 vectorization.

Consistent with the ToFloat2 pattern above.


582-622: LGTM! Templated AtomicAddx4 with proper PTX format.

The PTX instructions correctly follow the atom.<order>.<scope>.<space>.<op>.<type> format (e.g., atom.release.gpu.global.add.v4.f32).


665-675: LGTM! Non-SM90 fallback with consistent helper usage.

The fallback ToFloat2/ToFloat4 helpers and scalar atomicAdd calls provide correct behavior on older architectures.


698-721: LGTM! Fallback AtomicAddx4 for pre-SM90 devices.

Correctly decomposes the vectorized operation into four scalar atomicAdd calls.

src/transform/loop_partition.cc (1)

295-305: LGTM! Clean removal of dedicated atomic vectorization step.

The atomic add vectorization is now consolidated into the VectorizeLoop pass (Step 2), and step numbering is correctly updated. This simplifies the pipeline and improves maintainability as per PR objectives.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@LeiWang1999
Copy link
Member Author

@regression-perf

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21056734510

Results

File Original Latency Current Latency Speedup
example_tilelang_gemm_fp8_intrinsic 0.467449 0.933958 0.500504
example_warp_specialize_gemm_copy_1_gemm_0 0.038017 0.03936 0.965879
example_tilelang_gemm_fp8_2xAcc 0.188261 0.190438 0.988566
sparse_mla_fwd 0.130721 0.131943 0.990737
example_mha_fwd_varlen 0.0453265 0.0456827 0.992203
sparse_mla_fwd_pipelined 0.0948054 0.095391 0.993862
example_mha_bwd_bshd 0.0408453 0.0410761 0.994382
sparse_mla_bwd 0.384144 0.386225 0.994612
example_tilelang_gemm_fp8 0.320252 0.321796 0.9952
example_mha_bwd_bshd_wgmma_pipelined 0.0261581 0.0262814 0.995306
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.43094 3.44632 0.995539
example_dynamic 0.655216 0.657871 0.995964
example_mha_inference 0.080865 0.081186 0.996046
example_mha_sink_bwd_bhsd_sliding_window 0.044627 0.0447692 0.996825
example_warp_specialize_gemm_barrierpipe_stage2 0.039937 0.040064 0.99683
example_gemm_autotune 0.022272 0.022336 0.997135
example_mha_bwd_bhsd 0.0400738 0.0401751 0.99748
example_mha_sink_fwd_bhsd 0.0155872 0.015621 0.997839
example_dequant_gemm_bf16_mxfp4_hopper 0.509645 0.510728 0.997879
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0153991 0.0154283 0.998112
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0145984 0.0146255 0.99815
topk_selector 0.0539926 0.0540923 0.998156
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0154851 0.0155058 0.998664
example_convolution 1.33263 1.33414 0.998867
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0144566 0.0144729 0.998873
example_dequant_gemm_w4a8 5.39925 5.40217 0.999459
example_mla_decode 0.461066 0.461292 0.99951
tilelang_example_sparse_tensorcore 0.0150824 0.0150898 0.999512
example_linear_attn_fwd 0.0369878 0.0370028 0.999595
example_per_token_cast_to_fp8 0.00738024 0.0073831 0.999613
example_group_per_split_token_cast_to_fp8 0.0102857 0.0102896 0.999622
example_convolution_autotune 0.992859 0.99317 0.999687
example_gqa_sink_bwd_bhsd_sliding_window 0.0256221 0.0256288 0.999739
fp8_lighting_indexer 0.0359739 0.0359827 0.999758
example_gqa_bwd_tma_reduce_varlen 0.0523302 0.0523428 0.999759
example_tilelang_gemm_splitk_vectorize_atomicadd 1.42305 1.42338 0.99977
example_dequant_gemv_fp16xint4 0.0284187 0.0284251 0.999775
example_mha_sink_fwd_bhsd_sliding_window 0.015659 0.0156622 0.999797
example_gqa_sink_bwd_bhsd 0.0417043 0.0417126 0.999801
example_dequant_gemm_bf16_fp4_hopper 0.573614 0.573709 0.999834
example_tilelang_nsa_fwd 0.00700489 0.00700562 0.999896
example_topk 0.010912 0.010913 0.999908
example_fusedmoe_tilelang 0.141389 0.14139 0.999998
example_gemv 0.288839 0.288839 1
example_elementwise_add 0.297629 0.297614 1.00005
example_mha_sink_bwd_bhsd 0.0624642 0.062452 1.0002
block_sparse_attn_tilelang 0.010295 0.010292 1.00029
example_tilelang_nsa_decode 0.00675683 0.00675467 1.00032
example_tilelang_gemm_splitk 1.42191 1.42138 1.00037
example_gqa_bwd 0.049834 0.0498093 1.0005
example_gqa_bwd_wgmma_pipelined 0.0699156 0.0698553 1.00086
example_blocksparse_gemm 0.0227278 0.0227077 1.00089
example_gemm_schedule 0.0326165 0.0325799 1.00112
example_gemm 0.023009 0.022944 1.00283
example_gemm_intrinsics 0.035233 0.035105 1.00365
example_linear_attn_bwd 0.154741 0.153882 1.00558
example_gqa_decode 0.048641 0.048257 1.00796
example_warp_specialize_gemm_softpipe_stage2 0.039489 0.03888 1.01566
example_warp_specialize_gemm_copy_0_gemm_1 0.039329 0.038241 1.02845
example_dequant_gemm_fp4_hopper 1.07955 1.03363 1.04442

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999
Copy link
Member Author

@regression-perf

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21058280958

Results

File Original Latency Current Latency Speedup
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.54181 3.64796 0.970904
example_warp_specialize_gemm_copy_1_gemm_0 0.037825 0.038913 0.97204
example_dequant_gemm_fp4_hopper 1.05349 1.08235 0.973331
example_warp_specialize_gemm_copy_0_gemm_1 0.037857 0.038817 0.975269
example_warp_specialize_gemm_barrierpipe_stage2 0.03952 0.040065 0.986397
example_tilelang_gemm_fp8_2xAcc 0.184691 0.186931 0.988014
example_gemm_autotune 0.022175 0.022401 0.989911
example_mha_fwd_bshd 0.025729 0.025984 0.990186
example_mha_fwd_bhsd 0.010944 0.01104 0.991304
example_gqa_fwd_bshd 0.071586 0.071938 0.995107
example_mha_sink_bwd_bhsd_sliding_window 0.0447412 0.0449145 0.996142
example_warp_specialize_gemm_softpipe_stage2 0.039456 0.039552 0.997573
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0155677 0.0155883 0.998681
example_convolution 1.33369 1.3352 0.99887
example_dynamic 0.656363 0.657039 0.998971
example_dequant_gemm_bf16_fp4_hopper 0.573003 0.573548 0.99905
example_mha_bwd_bhsd 0.0401804 0.0402165 0.999102
example_gemm_schedule 0.0325683 0.0325971 0.999118
topk_selector 0.0540546 0.0541 0.999162
example_dequant_gemm_w4a8 5.39368 5.3974 0.999311
example_mha_sink_bwd_bhsd 0.0624407 0.0624813 0.99935
block_sparse_attn_tilelang 0.0102964 0.0103022 0.999445
fp8_lighting_indexer 0.0359701 0.0359874 0.999518
example_tilelang_gemm_fp8_intrinsic 0.934151 0.934581 0.999539
example_mha_bwd_bshd 0.041062 0.0410778 0.999615
sparse_mla_bwd 0.386233 0.386377 0.999627
example_gemv 0.288906 0.289007 0.999651
example_group_per_split_token_cast_to_fp8 0.0102885 0.0102908 0.999776
example_linear_attn_fwd 0.0370024 0.0370105 0.999782
example_per_token_cast_to_fp8 0.00738017 0.00738133 0.999842
example_tilelang_nsa_decode 0.00675654 0.00675739 0.999874
example_gqa_sink_bwd_bhsd 0.0416908 0.0416957 0.999884
example_topk 0.010944 0.010945 0.999909
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0144701 0.0144712 0.999918
example_blocksparse_gemm 0.0227054 0.0227072 0.999924
tilelang_example_sparse_tensorcore 0.0150847 0.0150856 0.999942
example_fusedmoe_tilelang 0.141397 0.141401 0.999974
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0146172 0.0146176 0.999978
sparse_mla_fwd 0.13196 0.131963 0.999983
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.015517 0.015517 0.999994
example_gqa_bwd 0.0498755 0.0498723 1.00006
example_mla_decode 0.461225 0.461192 1.00007
example_gqa_sink_bwd_bhsd_sliding_window 0.0256167 0.0256134 1.00013
example_dequant_gemv_fp16xint4 0.0284022 0.0283977 1.00016
example_tilelang_gemm_splitk_vectorize_atomicadd 1.42353 1.42303 1.00035
example_gqa_bwd_tma_reduce_varlen 0.0523734 0.0523551 1.00035
example_elementwise_add 0.297652 0.297524 1.00043
example_tilelang_nsa_fwd 0.00701808 0.00701263 1.00078
example_mha_fwd_varlen 0.0457716 0.0457353 1.00079
example_gqa_bwd_wgmma_pipelined 0.0700921 0.0700328 1.00085
example_mha_bwd_bshd_wgmma_pipelined 0.026301 0.0262786 1.00085
example_tilelang_gemm_splitk 1.42211 1.42055 1.0011
example_mha_sink_fwd_bhsd_sliding_window 0.015764 0.0157326 1.002
example_convolution_autotune 0.994996 0.992911 1.0021
sparse_mla_fwd_pipelined 0.0953981 0.095021 1.00397
example_gemm_intrinsics 0.035265 0.035104 1.00459
example_mha_inference 0.079425 0.079008 1.00528
example_linear_attn_bwd 0.154799 0.153889 1.00591
example_tilelang_gemm_fp8 0.323043 0.32095 1.00652
example_dequant_gemm_bf16_mxfp4_hopper 0.512527 0.509159 1.00661
example_mha_sink_fwd_bhsd 0.0157518 0.0156482 1.00662
example_mha_fwd_bshd_wgmma_pipelined 0.01424 0.014144 1.00679
example_gemm 0.022816 0.022657 1.00702
example_gqa_decode 0.048801 0.048417 1.00793
example_gqa_fwd_bshd_wgmma_pipelined 0.056353 0.055809 1.00975
example_mha_fwd_bhsd_wgmma_pipelined 0.014304 0.013952 1.02523

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999 LeiWang1999 merged commit 5feb225 into tile-ai:main Jan 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant