[Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass#1677
[Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass#1677LeiWang1999 merged 6 commits intotile-ai:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughThe PR generalizes CUDA atomic-add APIs with templated, vectorized overloads (x2/x4/Ret variants and ToFloat helpers) and moves atomic-add vectorization into the TLVectorizer by adding helpers and expr handlers, while removing the separate atomicadd_vectorize pass and its invocation from loop partitioning. Changes
Sequence Diagram(s)sequenceDiagram
participant LoopPartition as LoopPartition
participant TLVectorizer as TLVectorizer
participant CUDAAtomics as CUDA_Atomic_Templates
LoopPartition->>TLVectorizer: invoke vectorization on loop (no separate VectorizeAtomicAdd)
TLVectorizer->>TLVectorizer: detect address_of / atomic_add_elem_op in exprs
TLVectorizer->>CUDAAtomics: emit vectorized AtomicAddx2/AtomicAddx4 (templated ValType/dst)
CUDAAtomics-->>TLVectorizer: vectorized atomic call form (codegen-ready)
TLVectorizer-->>LoopPartition: return transformed loop containing atomic vector calls
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/loop_partition.cc (1)
300-305: Remove unusedatomicadd_vectorize.hinclude on line 32.The removal of the
VectorizeAtomicAddcall correctly aligns with the PR objective of consolidating vectorization into the unifiedvectorize_loop.ccpass. However, the corresponding include foratomicadd_vectorize.hon line 32 is now unused and should be removed—no symbols from that header are referenced elsewhere in this file.
🤖 Fix all issues with AI agents
In `@src/transform/vectorize_loop.cc`:
- Around line 562-569: The code calls ExtractBufferLoadForAtomic(dst) and
immediately uses dst_buffer_load.value(), which can be empty; add a null check
(e.g., test dst_buffer_load.has_value() or equivalent) before accessing
.value(), and if it's empty follow the existing fallback (return
tvm::ffi::GetRef<PrimExpr>(op)); update the block around dst_buffer_load, Target
target, and max_vec_size to bail out early when no BufferLoad was extracted to
avoid undefined behavior.
🧹 Nitpick comments (1)
src/tl_templates/cuda/atomic.h (1)
746-765: Type safety relies on caller constraints.The fallback implementations assume
dst_dtypeisfloat-compatible sinceatomicAddis called with float values fromfloat4. This works correctly becauseGetMaxAtomicVectorSizeonly enables vector size 4 forfloat32types. Consider adding astatic_assertfor extra safety.💡 Optional: Add compile-time type check
template <typename src_dtype, typename dst_dtype> TL_DEVICE void AtomicAddx4(dst_dtype *ref, src_dtype *val, int memory_order = int(cuda::memory_order_relaxed)) { + static_assert(std::is_same_v<dst_dtype, float>, + "AtomicAddx4 fallback only supports float destination type"); (void)memory_order; float4 add_val = *reinterpret_cast<float4 *>(val);
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/tl_templates/cuda/atomic.hsrc/transform/loop_partition.ccsrc/transform/vectorize_loop.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/vectorize_loop.cc
🧬 Code graph analysis (1)
src/transform/vectorize_loop.cc (1)
src/transform/atomicadd_vectorize.cc (4)
dtype(64-73)dtype(64-64)GetVectorizedAtomicOp(39-48)GetVectorizedAtomicOp(39-39)
🔇 Additional comments (6)
src/transform/vectorize_loop.cc (4)
40-44: LGTM: Appropriate includes added for atomic vectorization support.The new includes are correctly added to support the atomic vectorization functionality being moved into this file.
124-168: LGTM: Helper functions correctly mirror the original atomicadd_vectorize.cc implementations.The logic for
GetVectorizedAtomicOpandGetMaxAtomicVectorSizealigns with the existing implementations inatomicadd_vectorize.cc, ensuring consistent behavior as the vectorization is consolidated into this pass.
501-526: LGTM: Address normalization logic is correct.The substitution of
var_with0correctly removes the vectorized index component to yield the base address for atomic operations.
595-599: LGTM: Expression visitor correctly routes atomic operations.The new dispatch cases properly integrate atomic add vectorization into the existing expression visitor pattern.
src/tl_templates/cuda/atomic.h (2)
567-608: Template generalization looks correct given usage constraints.The template parameters
src_dtypeanddst_dtypeare added for API flexibility, but the implementation still assumes float-compatible types viareinterpret_cast<float4*>. This is safe becauseGetMaxAtomicVectorSizeonly returns 4 forfloat32types on SM >= 90, ensuring these functions are only called with float pointers.
610-646: Newfloat4value overload is a useful addition.This overload simplifies call sites that already have the value in
float4form, avoiding an unnecessary address-of operation.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| auto dst_buffer_load = ExtractBufferLoadForAtomic(dst); | ||
| Target target = Target::Current(false); | ||
| int max_vec_size = | ||
| GetMaxAtomicVectorSize(dst_buffer_load.value()->buffer->dtype, target); | ||
| if (vector_size > max_vec_size) { | ||
| // Vector size not supported for this dtype, cannot vectorize | ||
| return tvm::ffi::GetRef<PrimExpr>(op); | ||
| } |
There was a problem hiding this comment.
Missing null check on Optional<BufferLoad> before calling .value().
ExtractBufferLoadForAtomic(dst) returns an Optional<BufferLoad>. If dst is neither a BufferLoad nor an address_of wrapping a BufferLoad, the returned Optional will be empty, and calling .value() on line 565 will cause undefined behavior.
🐛 Proposed fix to add null check
auto dst = VisitExpr(op->args[0]);
auto src = VisitExpr(op->args[1]);
// Check if dtype supports this vector size
auto dst_buffer_load = ExtractBufferLoadForAtomic(dst);
+ if (!dst_buffer_load.defined()) {
+ // Cannot extract buffer load, fall back to non-vectorized path
+ return tvm::ffi::GetRef<PrimExpr>(op);
+ }
Target target = Target::Current(false);
+ if (!target.defined()) {
+ // No target available, fall back to non-vectorized path
+ return tvm::ffi::GetRef<PrimExpr>(op);
+ }
int max_vec_size =
GetMaxAtomicVectorSize(dst_buffer_load.value()->buffer->dtype, target);🤖 Prompt for AI Agents
In `@src/transform/vectorize_loop.cc` around lines 562 - 569, The code calls
ExtractBufferLoadForAtomic(dst) and immediately uses dst_buffer_load.value(),
which can be empty; add a null check (e.g., test dst_buffer_load.has_value() or
equivalent) before accessing .value(), and if it's empty follow the existing
fallback (return tvm::ffi::GetRef<PrimExpr>(op)); update the block around
dst_buffer_load, Target target, and max_vec_size to bail out early when no
BufferLoad was extracted to avoid undefined behavior.
There was a problem hiding this comment.
Pull request overview
This PR refactors atomic add vectorization by moving the transformation logic from the separate atomicadd_vectorize.cc pass into the unified vectorize_loop.cc pass. The refactor consolidates vectorization transformations for better maintainability while preserving the same functionality.
Changes:
- Atomic add vectorization logic integrated into
TLVectorizerclass invectorize_loop.cc - Removed separate
VectorizeAtomicAddcall from loop partitioning pipeline - Added template overloads in
atomic.hto supportfloat4value parameter
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/transform/vectorize_loop.cc | Adds atomic add vectorization methods (MutateAtomicAddExpr_, MutateAddressOfCall_) and helper functions to the TLVectorizer class |
| src/transform/loop_partition.cc | Removes the separate VectorizeAtomicAdd call and associated include, as vectorization is now handled by VectorizeLoop |
| src/tl_templates/cuda/atomic.h | Adds template overloads for AtomicAddx4 and AtomicAddx4Ret accepting float4 values directly |
Comments suppressed due to low confidence (1)
src/transform/loop_partition.cc:32
- The include for "atomicadd_vectorize.h" is still present but no longer used since VectorizeAtomicAdd has been removed from this file. This include should be removed.
#include "atomicadd_vectorize.h"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/tl_templates/cuda/atomic.h
Outdated
| if (memory_order == int(cuda::memory_order_release) || | ||
| memory_order == int(cuda::memory_order_consume)) { | ||
| asm volatile( | ||
| "atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], " |
There was a problem hiding this comment.
The PTX instruction format is incorrect. The order should be "atom.release.gpu.global.add.v4.f32" not "atom.global.gpu.release.add.v4.f32". The memory order (release) should come before the scope (global), matching the format used in lines 624, 631, and 639 for the similar overload.
src/tl_templates/cuda/atomic.h
Outdated
| : "memory"); | ||
| } else if (memory_order == int(cuda::memory_order_acquire)) { | ||
| asm volatile( | ||
| "atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], " |
There was a problem hiding this comment.
The PTX instruction format is incorrect. The order should be "atom.acquire.gpu.global.add.v4.f32" not "atom.global.gpu.acquire.add.v4.f32". The memory order (acquire) should come before the scope (global), matching the format used in line 630 for the similar overload.
src/tl_templates/cuda/atomic.h
Outdated
| } else if (memory_order == int(cuda::memory_order_acq_rel) || | ||
| memory_order == int(cuda::memory_order_seq_cst)) { | ||
| asm volatile( | ||
| "atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], " |
There was a problem hiding this comment.
The PTX instruction format is incorrect. The order should be "atom.acq_rel.gpu.global.add.v4.f32" not "atom.global.gpu.acq_rel.add.v4.f32". The memory order (acq_rel) should come before the scope (global), matching the format used in lines 638-639 for the similar overload.
| "atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], " | |
| "atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " |
| auto dst = VisitExpr(op->args[0]); | ||
| auto src = VisitExpr(op->args[1]); | ||
| // Check if dtype supports this vector size | ||
| auto dst_buffer_load = ExtractBufferLoadForAtomic(dst); |
There was a problem hiding this comment.
Potential crash if dst_buffer_load is not defined. The code calls .value() without checking if the Optional has a value. Add a check: if (!dst_buffer_load.defined()) { return tvm::ffi::GetRef<PrimExpr>(op); } before line 563.
| auto dst_buffer_load = ExtractBufferLoadForAtomic(dst); | |
| auto dst_buffer_load = ExtractBufferLoadForAtomic(dst); | |
| if (!dst_buffer_load.defined()) { | |
| return tvm::ffi::GetRef<PrimExpr>(op); | |
| } |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tl_templates/cuda/atomic.h (1)
712-752: Incorrect PTX instruction format inAtomicAddx4Ret.The PTX instructions at lines 725, 733, and 742 have incorrect ordering. The correct PTX syntax is
atom.sem.scope.space.op.type, but these instructions haveatom.space.scope.sem.op.type.Compare with the correct format used in
AtomicAddx4at lines 646, 654, 663:atom.release.gpu.global.add.v4.f32Proposed fix
if (memory_order == int(cuda::memory_order_release) || memory_order == int(cuda::memory_order_consume)) { - asm volatile("atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], " + asm volatile("atom.release.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};" : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z), "=f"(ret_val.w) : "l"(ref_addr), "f"(add_val.x), "f"(add_val.y), "f"(add_val.z), "f"(add_val.w) : "memory"); } else if (memory_order == int(cuda::memory_order_acquire)) { - asm volatile("atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], " + asm volatile("atom.acquire.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};" : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z), "=f"(ret_val.w) : "l"(ref_addr), "f"(add_val.x), "f"(add_val.y), "f"(add_val.z), "f"(add_val.w) : "memory"); } else if (memory_order == int(cuda::memory_order_acq_rel) || memory_order == int(cuda::memory_order_seq_cst)) { - asm volatile("atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], " + asm volatile("atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};"
🤖 Fix all issues with AI agents
In `@src/tl_templates/cuda/atomic.h`:
- Around line 600-629: The function AtomicAddx2Ret currently declares an unused
template parameter src_type; remove the template so the function becomes a plain
(non-templated) function. Specifically, change the declaration/definition from
template<typename src_type> TL_DEVICE float2 AtomicAddx2Ret(...) to TL_DEVICE
float2 AtomicAddx2Ret(float *ref, float2 val, int memory_order =
int(cuda::memory_order_relaxed)), and update any corresponding forward
declarations/overloads to match; ensure no remaining references to src_type
remain in the AtomicAddx2Ret implementation or its declarations.
♻️ Duplicate comments (1)
src/tl_templates/cuda/atomic.h (1)
754-789: Incorrect PTX instruction format (same issue as lines 725-742).The PTX instructions at lines 766, 773, and 781 have the same incorrect ordering issue. The memory order qualifier should precede the scope and space.
Proposed fix
if (memory_order == int(cuda::memory_order_release) || memory_order == int(cuda::memory_order_consume)) { asm volatile( - "atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], " + "atom.release.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};" : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z), "=f"(ret_val.w) : "l"(ref_addr), "f"(val.x), "f"(val.y), "f"(val.z), "f"(val.w) : "memory"); } else if (memory_order == int(cuda::memory_order_acquire)) { asm volatile( - "atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], " + "atom.acquire.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};" : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z), "=f"(ret_val.w) : "l"(ref_addr), "f"(val.x), "f"(val.y), "f"(val.z), "f"(val.w) : "memory"); } else if (memory_order == int(cuda::memory_order_acq_rel) || memory_order == int(cuda::memory_order_seq_cst)) { asm volatile( - "atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], " + "atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};"
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/tl_templates/cuda/atomic.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (8)
src/tl_templates/cuda/atomic.h (8)
330-375: LGTM!The templated
AtomicAddx2forhalf_tcorrectly uses the PTX instruction format (atom.release.gpu.global.add.noftz.v2.f16) and handles all memory ordering cases appropriately.
377-420: LGTM!Correctly returns the old value from the atomic operation with proper PTX syntax.
422-505: LGTM!The bfloat16 variants correctly implement vectorized atomic adds with proper PTX syntax for all memory orderings.
507-537: LGTM!Correctly implements vectorized float2 atomic add with proper PTX instruction ordering.
539-565: LGTM!Good addition of the by-value overload for
float2, providing a cleaner API when the caller already has a vector value.
631-672: LGTM!The dual-template
AtomicAddx4correctly supports cross-type vectorized operations with proper PTX syntax.
674-710: LGTM!Good addition of the by-value
float4overload with correct PTX instruction ordering.
790-859: LGTM!The fallback implementations for pre-SM90 architectures correctly decompose vectorized operations into scalar
atomicAddcalls. The new templated overloads are consistent with the SM90+ versions.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
…undancy - Introduced utility functions ToFloat2 and ToFloat4 for type conversion, enhancing the flexibility of atomic add operations. - Updated AtomicAddx2 and AtomicAddx4 functions to accept generic value types, streamlining the code and improving readability. - Removed outdated function overloads and consolidated logic for atomic addition, ensuring consistency across implementations. - Enhanced memory order handling in atomic operations for better performance and clarity.
- Deleted the atomicadd_vectorize.cc and atomicadd_vectorize.h files, which contained the automatic vectorization pass for atomic add operations. - Removed references to the deleted files in atomic_add.cc and loop_partition.cc to clean up the codebase and eliminate unused dependencies.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@testing/python/language/test_tilelang_language_atomic.py`:
- Around line 490-514: Add the missing runtime CUDA presence check to the test
by decorating test_atomic_add_complicated_parallel with
`@tilelang.testing.requires_cuda` in addition to the existing
`@tilelang.testing.requires_cuda_compute_version_ge`(9, 0); locate the test
function named test_atomic_add_complicated_parallel (and the helper
run_atomic_add_complicated_parallel / atomic_add_complicated_parallel_program)
and place the requires_cuda decorator immediately above the function so the test
skips on machines without a CUDA driver.
♻️ Duplicate comments (1)
src/tl_templates/cuda/atomic.h (1)
636-659: Inconsistent PTX instruction format inAtomicAddx4Ret.The PTX instructions here use
atom.global.gpu.<order>.add.v4.f32but the correct format (matchingAtomicAddx4above at lines 596-619) should beatom.<order>.gpu.global.add.v4.f32. The memory order should come before the scope.Suggested fix
if (memory_order == int(cuda::memory_order_release) || memory_order == int(cuda::memory_order_consume)) { - asm volatile("atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], " + asm volatile("atom.release.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};" : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z), "=f"(ret_val.w) : "l"(ref_addr), "f"(add_val.x), "f"(add_val.y), "f"(add_val.z), "f"(add_val.w) : "memory"); } else if (memory_order == int(cuda::memory_order_acquire)) { - asm volatile("atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], " + asm volatile("atom.acquire.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};" : "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z), "=f"(ret_val.w) : "l"(ref_addr), "f"(add_val.x), "f"(add_val.y), "f"(add_val.z), "f"(add_val.w) : "memory"); } else if (memory_order == int(cuda::memory_order_acq_rel) || memory_order == int(cuda::memory_order_seq_cst)) { - asm volatile("atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], " + asm volatile("atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], " "{%5,%6,%7,%8};"
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/op/atomic_add.ccsrc/tl_templates/cuda/atomic.hsrc/transform/atomicadd_vectorize.ccsrc/transform/atomicadd_vectorize.hsrc/transform/loop_partition.cctesting/python/language/test_tilelang_language_atomic.py
💤 Files with no reviewable changes (3)
- src/op/atomic_add.cc
- src/transform/atomicadd_vectorize.cc
- src/transform/atomicadd_vectorize.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-01-06T05:20:51.649Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1606
File: testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py:30-30
Timestamp: 2026-01-06T05:20:51.649Z
Learning: In `testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py`, the test validates that the `hoist_broadcast_values` transformation pass correctly identifies and hoists broadcast operations by checking for patterns in the generated kernel source code. The specific literal values used (e.g., 430) are not important for the test's purpose, as it does not validate numerical precision or actual stored tensor values.
Applied to files:
testing/python/language/test_tilelang_language_atomic.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_atomic.py
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_atomic.py (3)
tilelang/language/dtypes.py (1)
float32(310-310)tilelang/testing/__init__.py (1)
requires_cuda_compute_version_ge(107-108)tilelang/language/atomic.py (1)
atomic_add(194-301)
🔇 Additional comments (7)
testing/python/language/test_tilelang_language_atomic.py (1)
479-487: LGTM! Test validates auto-vectorization of atomic add.The test correctly verifies that the kernel source contains
AtomicAddx4, confirming the vectorization pass works as expected.src/tl_templates/cuda/atomic.h (5)
508-512: LGTM! Clean helper functions for type conversion.
ToFloat2overloads provide a uniform interface for both pointer and value arguments, enabling the templatedAtomicAddx2to accept diverse input types.
576-580: LGTM! AnalogousToFloat4helpers for x4 vectorization.Consistent with the
ToFloat2pattern above.
582-622: LGTM! TemplatedAtomicAddx4with proper PTX format.The PTX instructions correctly follow the
atom.<order>.<scope>.<space>.<op>.<type>format (e.g.,atom.release.gpu.global.add.v4.f32).
665-675: LGTM! Non-SM90 fallback with consistent helper usage.The fallback
ToFloat2/ToFloat4helpers and scalaratomicAddcalls provide correct behavior on older architectures.
698-721: LGTM! FallbackAtomicAddx4for pre-SM90 devices.Correctly decomposes the vectorized operation into four scalar
atomicAddcalls.src/transform/loop_partition.cc (1)
295-305: LGTM! Clean removal of dedicated atomic vectorization step.The atomic add vectorization is now consolidated into the
VectorizeLooppass (Step 2), and step numbering is correctly updated. This simplifies the pipeline and improves maintainability as per PR objectives.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
Summary
atomicadd_vectorize.ccto the unifiedvectorize_loop.ccpassatomicadd_vectorize.ccnow focuses only on atomic-specific rewrites while general vectorization is handled byvectorize_loop.ccTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Refactor
Tests
✏️ Tip: You can customize this high-level summary in your review settings.