Skip to content

[Feature] Add memory_order PTX for vectorized atomic add#1112

Merged
LeiWang1999 merged 10 commits intotile-ai:mainfrom
tzj-fxz:atomic1022
Oct 25, 2025
Merged

[Feature] Add memory_order PTX for vectorized atomic add#1112
LeiWang1999 merged 10 commits intotile-ai:mainfrom
tzj-fxz:atomic1022

Conversation

@tzj-fxz
Copy link
Contributor

@tzj-fxz tzj-fxz commented Oct 23, 2025

Summary by CodeRabbit

  • Chores
    • Improved atomic operation handling on CUDA: vectorized atomics now respect explicit memory-order semantics and use safe fallbacks for non-relaxed orders, preserving prior fast paths for relaxed operations.
  • Refactor
    • Modernized Python typing and annotation handling; added a special-case for boolean shared allocation to address a dtype workflow edge case.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

Walkthrough

The PR makes CUDA atomic templates memory-order-aware: relaxed fast-paths retained for half/bf16 and vectorized adds; non-relaxed orders use cuda::atomic_ref or PTX inline-assembly fallbacks. Also updates tilelang allocation: postponed annotations, bool special-case for shared scope, and a refined alloc_var annotation.

Changes

Cohort / File(s) Change Summary
CUDA atomic templates
src/tl_templates/cuda/atomic.h
Adds #include <cuda_fp16.h>; limits half/bf16 fast-paths for AtomicMax/AtomicMin variants to memory_order::relaxed (else use cuda::atomic_ref); extends AtomicAddx2/AtomicAddx2Ret/AtomicAddx4/AtomicAddx4Ret and bfloat16/half vectorized variants to implement non-relaxed fallbacks using PTX inline-assembly with branches for release/acquire/acq_rel/seq_cst while keeping original relaxed fast-paths.
TileLang allocation utilities
tilelang/language/allocate.py
Adds from __future__ import annotations; special-cases alloc_shared to set scope="shared" when dtype == "bool"; updates alloc_var type annotation from Union[PrimExpr] to `PrimExpr

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant AtomicFunc as Atomic<Op>xN
  participant FastPath
  participant PTXFallback
  participant Memory

  Caller->>AtomicFunc: call AtomicAddxN(value, memory_order)
  alt memory_order == relaxed
    AtomicFunc->>FastPath: use relaxed reinterpret_cast fast-path
    FastPath->>Memory: atomic add (fast)
    FastPath-->>AtomicFunc: result
  else memory_order != relaxed
    AtomicFunc->>PTXFallback: emit PTX inline-atomic with ordering
    PTXFallback->>Memory: atomic add (PTX, ordered)
    PTXFallback-->>AtomicFunc: result
  end
  AtomicFunc-->>Caller: return result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

Poem

🐰 A rabbit hops where atomics play,
Fast paths sprint when orders say "relax",
PTX steps in when rules are stern,
Bits and halves in careful tracks,
A tiny hop, then all syncs back. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "[Feature] Add memory_order PTX for vectorized atomic add" accurately describes the primary and most substantial change in the changeset—the modifications to src/tl_templates/cuda/atomic.h that add memory-order-aware PTX-based inline atomic operations for vectorized atomic add operations. This matches the high code review effort estimate for atomic.h and reflects the core feature being added. While the PR also includes secondary changes to tilelang/language/allocate.py (annotations, bool dtype handling, and type annotation updates), these appear to be tangential modifications, and the title appropriately focuses on the main and most significant change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@tzj-fxz tzj-fxz self-assigned this Oct 23, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
src/tl_templates/cuda/atomic.h (1)

152-157: Prefer intrinsic bit-casts over reinterpret_cast through unrelated types.

Using reinterpret_cast<unsigned short *>(&add_val.x) and similar violates strict aliasing. Prefer CUDA intrinsics or safe bit_cast.

  • For half: use __half_as_ushort / __ushort_as_half, or std::bit_cast<unsigned short>(add_val.x) if available in your device C++20 mode.
  • For bf16: use the bf16 raw‑bit helpers in cuda_bf16.h (or memcpy as a portable fallback).

Also applies to: 194-199, 240-245, 280-285, 321-324, 353-354, 386-388, 426-428

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86c8bb4 and c65c9d3.

📒 Files selected for processing (1)
  • src/tl_templates/cuda/atomic.h (6 hunks)
🔇 Additional comments (4)
src/tl_templates/cuda/atomic.h (4)

49-56: Fast-path gating for min/max looks good; minor note on availability.

Conditionally using native atomics for half/__nv_bfloat16 only on relaxed order is sound. Ensure the native min/max overloads exist for your targeted toolchain; otherwise, keep the atomic_ref fallback.

Also applies to: 64-73, 81-88, 96-105


314-343: Float v2/v4 PTX paths: consistent and well-structured; mirror fixes from other comments.

The float vectorized paths look consistent. Apply the same consume mapping fix and maintain the .sem .scope .space order (already correct here). Also keep alignment/arch notes in mind.

Also applies to: 345-375, 377-417, 419-458


461-471: AtomicLoad/Store via cuda::atomic_ref — OK.

Looks correct and matches the scalar paths’ semantics.


7-12: Includes and type normalization — OK; minor portability note.

<cuda_fp16.h> addition and normalize_atomic_type specializations look fine. Confirm __CUDA_ARCH_LIST__ usage matches your build system; some environments gate on __CUDA_ARCH__ only.

Also applies to: 34-42

Comment on lines +143 to +184
if (memory_order == int(cuda::memory_order_relaxed)) {
atomicAdd(reinterpret_cast<half2 *>(ref),
static_cast<half2>(*reinterpret_cast<half2 *>(val)));
} else {
// Since atomicAdd does not support memory order, atomic_ref does not
// support vectorized atomic operation we can only inline ptx code here
// Note: Vectorized atomic operations only support global space
// Note: for 16-bit value, we need to reinterpret_cast the value to unsigned
// short and use "h" register in assembly
__half2 add_val = *reinterpret_cast<__half2 *>(val);
unsigned short add_val_x_cast =
*reinterpret_cast<unsigned short *>(&add_val.x);
unsigned short add_val_y_cast =
*reinterpret_cast<unsigned short *>(&add_val.y);
unsigned long long ref_addr = reinterpret_cast<unsigned long long>(ref);
__half ret_val_x, ret_val_y;
unsigned short ret_val_x_cast =
*reinterpret_cast<unsigned short *>(&ret_val_x);
unsigned short ret_val_y_cast =
*reinterpret_cast<unsigned short *>(&ret_val_y);
if (memory_order == int(cuda::memory_order_release) ||
memory_order == int(cuda::memory_order_consume)) {
asm volatile(
"atom.release.gpu.global.add.noftz.v2.f16 {%0,%1}, [%2], {%3,%4};"
: "=h"(ret_val_x_cast), "=h"(ret_val_y_cast)
: "l"(ref_addr), "h"(add_val_x_cast), "h"(add_val_y_cast)
: "memory");
} else if (memory_order == int(cuda::memory_order_acquire)) {
asm volatile(
"atom.acquire.gpu.global.add.noftz.v2.f16 {%0,%1}, [%2], {%3,%4};"
: "=h"(ret_val_x_cast), "=h"(ret_val_y_cast)
: "l"(ref_addr), "h"(add_val_x_cast), "h"(add_val_y_cast)
: "memory");
} else if (memory_order == int(cuda::memory_order_acq_rel) ||
memory_order == int(cuda::memory_order_seq_cst)) {
asm volatile(
"atom.acq_rel.gpu.global.add.noftz.v2.f16 {%0,%1}, [%2], {%3,%4};"
: "=h"(ret_val_x_cast), "=h"(ret_val_y_cast)
: "l"(ref_addr), "h"(add_val_x_cast), "h"(add_val_y_cast)
: "memory");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Guard non-relaxed PTX paths for half/bf16 with arch checks like float (>= sm_90).

Vectorized atomics with memory semantics require recent PTX/SM. Currently only float v2/v4 are guarded. Without guards, compilation may fail for lower arch lists even if the branch is not taken at runtime.

Wrap the non‑relaxed inline PTX under the same (or correct) guard you used for float:

+#if (defined(__CUDA_ARCH_LIST__) && (__CUDA_ARCH_LIST__ >= 900))
   // existing non-relaxed inline PTX...
-  } else {
+  } else {
     ...
   }
+#else
+  } else {
+    // Fallback: no vectorized non‑relaxed support on this arch
+    cuda::atomic_ref<half, cuda::thread_scope_device> ax(ref[0]);
+    cuda::atomic_ref<half, cuda::thread_scope_device> ay(ref[1]);
+    ax.fetch_add(val[0], cuda::memory_order(memory_order));
+    ay.fetch_add(val[1], cuda::memory_order(memory_order));
+  }
+#endif

Please verify the minimum SM/PTX that supports atom.*.v2.f16/v2.bf16.

Also applies to: 190-229, 234-268, 274-311


I'll verify the claims in this review comment by examining the actual code structure and confirming the SM/PTX requirements.


🏁 Script executed:

#!/bin/bash
# Inspect the file to see actual guards and structure for float vs half/bf16
fd -t f "atomic.h" | head -5

Length of output: 30


🏁 Script executed:

#!/bin/bash
# Get file content around the specified lines to see guards
if [ -f "src/tl_templates/cuda/atomic.h" ]; then
  echo "=== Lines 130-190 (float section) ===" 
  sed -n '130,190p' "src/tl_templates/cuda/atomic.h"
  echo ""
  echo "=== Lines 180-240 (half section start) ==="
  sed -n '180,240p' "src/tl_templates/cuda/atomic.h"
fi

Length of output: 5670


🏁 Script executed:

#!/bin/bash
# Search for #if directives and arch checks throughout the file
if [ -f "src/tl_templates/cuda/atomic.h" ]; then
  echo "=== Architecture guards in file ==="
  rg -n "#if|__CUDA_ARCH|sm_90" "src/tl_templates/cuda/atomic.h" -C 2
fi

Length of output: 944


🌐 Web query:

CUDA sm_90 atom.release.gpu.global.add.v2.f16 minimum architecture requirements

💡 Result:

Short answer: atom.release.gpu.global.add.v2.f16 requires NVIDIA SM 90 (compute capability 9.0 / Hopper) or newer. Sources: NVIDIA PTX/Parallel-Thread-Execution docs and Hopper compatibility guide. [1][2][3]

Sources

  • NVIDIA PTX / PTX ISA notes (atomic/vector support and SM requirements). [1]
  • PTX Parallel-Thread-Execution doc showing vector atomics (atom.global.v2.f16.add) and SM requirements. [2]
  • Hopper compatibility / CUDA docs (SM 90 / compute capability 9.0 = Hopper). [3]

I need to verify the additional sections mentioned ("also applies to: 190-229, 234-268, 274-311"):


🏁 Script executed:

#!/bin/bash
# Check lines 234-311 to verify guard status on remaining sections
if [ -f "src/tl_templates/cuda/atomic.h" ]; then
  echo "=== Lines 230-270 (bf16 AtomicAddx2Ret and next section) ==="
  sed -n '230,270p' "src/tl_templates/cuda/atomic.h"
  echo ""
  echo "=== Lines 270-312 (remainder) ==="
  sed -n '270,312p' "src/tl_templates/cuda/atomic.h"
fi

Length of output: 4320


Add SM 90 arch guards to half and bf16 non-relaxed PTX paths; currently unguarded (half) or insufficiently guarded (bf16) while float is correctly guarded.

All three sections require fixes:

  1. Lines 143-184 (AtomicAddx2 half): Completely unguarded; contains atom.release/acquire/acq_rel.gpu.global.add.noftz.v2.f16 which requires SM 90.

  2. Lines 190-229 (AtomicAddx2Ret half): Completely unguarded; same PTX instruction dependency.

  3. Lines 231-269 & 270-312 (AtomicAddx2 and AtomicAddx2Ret bf16): Guarded with #if (__CUDA_ARCH_LIST__ > 750) (line 231), but requires >= 900 for atom.*.v2.bf16 support. The #endif at line 312 closes the bf16 block.

Float (lines 314+) correctly uses #if (__CUDA_ARCH_LIST__ >= 900). Half sections must use identical guards; bf16 guard must be tightened to >= 900. Without these, compilation fails for architectures below SM 90 even when the non-relaxed branch is unreachable.

🤖 Prompt for AI Agents
In src/tl_templates/cuda/atomic.h around lines 143-184, 190-229, and 231-312,
the PTX vectorized atomic instructions for half and bf16 (atom.*.v2.f16 and
atom.*.v2.bf16) are currently unguarded or guarded only for >750, which allows
compilation on architectures that lack SM 90 support; wrap the entire
non-relaxed PTX branches that emit these instructions with the same architecture
guard used for float (i.e., #if (__CUDA_ARCH_LIST__ >= 900)) and change the
existing bf16 guard from >750 to >=900, making sure each #if has the matching
#endif around the PTX asm blocks so the code only emits these instructions for
SM 90+.

Comment on lines +144 to +146
atomicAdd(reinterpret_cast<half2 *>(ref),
static_cast<half2>(*reinterpret_cast<half2 *>(val)));
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Alignment preconditions for vectorized operations are not enforced.

Reinterpreting half_t*/float* as half2*/float2*/float4* requires 4/8/16‑byte alignment. Misalignment is UB and can fault on hardware.

  • Document and assert alignment (static_assert on types is insufficient; add runtime assert(((uintptr_t)ref % align) == 0) in debug).
  • Alternatively, load/add/store scalars when non‑relaxed and alignment isn’t guaranteed (see previous guard fallback snippet).

Also applies to: 191-193, 235-238, 275-278, 318-320, 349-351, 380-382, 423-425





---

🏁 Script executed:

```shell
#!/bin/bash
# Locate the file and display the full content around the mentioned line ranges
fd "atomic.h" --type f | head -5

Length of output: 30


🏁 Script executed:

#!/bin/bash
# Find the file in the repository
find . -name "atomic.h" -path "*/tl_templates/cuda/*" 2>/dev/null | head -1

Length of output: 32


🏁 Script executed:

#!/bin/bash
# Search for reinterpret_cast patterns with half2, float2, float4 in the atomic operations
rg -n "reinterpret_cast.*half2|reinterpret_cast.*float2|reinterpret_cast.*float4" --type cpp -A2 -B2

Length of output: 4830


🏁 Script executed:

#!/bin/bash
# Check for any existing alignment checks or assertions in the file
rg -n "assert.*align|alignment|uintptr_t.*%|__align|assume_aligned" --type cpp

Length of output: 9436


🏁 Script executed:

#!/bin/bash
# Get a broader view of the atomic.h file structure and line counts
wc -l src/tl_templates/cuda/atomic.h 2>/dev/null || find . -name "atomic.h" -path "*/tl_templates/cuda/*" -exec wc -l {} \;

Length of output: 34


🏁 Script executed:

#!/bin/bash
# Display specific line ranges mentioned in the review
if [ -f "src/tl_templates/cuda/atomic.h" ]; then
  echo "=== Lines 144-146 ==="; sed -n '144,146p' src/tl_templates/cuda/atomic.h
  echo "=== Lines 191-193 ==="; sed -n '191,193p' src/tl_templates/cuda/atomic.h
  echo "=== Lines 235-238 ==="; sed -n '235,238p' src/tl_templates/cuda/atomic.h
  echo "=== Lines 275-278 ==="; sed -n '275,278p' src/tl_templates/cuda/atomic.h
  echo "=== Lines 318-320 ==="; sed -n '318,320p' src/tl_templates/cuda/atomic.h
  echo "=== Lines 349-351 ==="; sed -n '349,351p' src/tl_templates/cuda/atomic.h
  echo "=== Lines 380-382 ==="; sed -n '380,382p' src/tl_templates/cuda/atomic.h
  echo "=== Lines 423-425 ==="; sed -n '423,425p' src/tl_templates/cuda/atomic.h
fi

Length of output: 1296



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the function signatures and broader context for the operations
sed -n '140,160p' src/tl_templates/cuda/atomic.h
sed -n '230,250p' src/tl_templates/cuda/atomic.h
sed -n '313,330p' src/tl_templates/cuda/atomic.h
sed -n '375,395p' src/tl_templates/cuda/atomic.h

Length of output: 4234


🏁 Script executed:

#!/bin/bash
# Search for function parameter types and any documentation about alignment
rg -n "void.*atomic_ref|ref.*half_t|ref.*float\*|val.*half_t" --type cpp -B5 -A1 src/tl_templates/cuda/atomic.h | head -60

Length of output: 469


🏁 Script executed:

#!/bin/bash
# Look for any static_asserts in the file
grep -n "static_assert\|_Static_assert" src/tl_templates/cuda/atomic.h

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation or comments about alignment preconditions
sed -n '100,150p' src/tl_templates/cuda/atomic.h | grep -n -C3 "align\|precond\|require"

Length of output: 0








🏁 Script executed:

#!/bin/bash
# Search for callers of AtomicAddx2 and AtomicAddx4 functions
rg -n "AtomicAddx2|AtomicAddx4" --type cpp -g '!atomic.h' -B2 -A2 | head -100

Length of output: 562


🏁 Script executed:

#!/bin/bash
# Check if there are any type or memory layout constraints in related headers
find src/tl_templates/cuda -name "*.h" -type f -exec grep -l "AtomicAdd\|alignment\|align" {} \;

Length of output: 235


🌐 Web query:

CUDA atomicAdd vectorized float2 float4 alignment requirements

💡 Result:

Short answer

  • atomicAdd(float2*) requires 8-byte alignment; atomicAdd(float4*) requires 16-byte alignment. Use these only on properly aligned global-memory addresses. [1][2]

Details (important)

  • CUDA Programming Guide: float2/float4 vector atomicAdd are supported for global memory on devices with compute capability 9.x+ and guarantee atomicity per float element (not as a single 64- or 128-bit atomic access). [1]
  • Built-in vector-type alignment table: float2 (8-byte alignment), float4 (16-byte alignment) — do not cast from a 4-byte-aligned float* to float2*/float4* unless you can guarantee the stricter alignment. [2]

If you want, tell me your CUDA version / target GPU compute capability and I’ll show a minimal safe implementation (with alignment attributes or memcpy-based workaround).


Add runtime alignment assertions to prevent undefined behavior from misaligned vectorized atomic operations.

The concern is valid. CUDA vectorized atomicAdd operations require strict alignment: float2 requires 8-byte alignment, float4 requires 16-byte alignment. The code reinterpret_casts raw pointers to vectorized types without enforcing these preconditions, which causes undefined behavior and potential hardware faults.

All 8 locations (lines 144–146, 191–193, 235–238, 275–278, 318–320, 349–351, 380–382, 423–425) lack guards. Add runtime assert(((uintptr_t)ref % alignment) == 0) checks, or implement a fallback scalar path when alignment cannot be guaranteed.

Comment on lines +158 to +163
__half ret_val_x, ret_val_y;
unsigned short ret_val_x_cast =
*reinterpret_cast<unsigned short *>(&ret_val_x);
unsigned short ret_val_y_cast =
*reinterpret_cast<unsigned short *>(&ret_val_y);
if (memory_order == int(cuda::memory_order_release) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Uninitialized read into ret_val_*_cast before asm.

ret_val_x_cast and ret_val_y_cast are initialized by reading the bytes of uninitialized ret_val_*. That’s UB. Make them plain locals; asm outputs will write them.

-    __half ret_val_x, ret_val_y;
-    unsigned short ret_val_x_cast =
-        *reinterpret_cast<unsigned short *>(&ret_val_x);
-    unsigned short ret_val_y_cast =
-        *reinterpret_cast<unsigned short *>(&ret_val_y);
+    unsigned short ret_val_x_cast;  // written by asm
+    unsigned short ret_val_y_cast;  // written by asm

And similarly for the bf16 paths:

-    __nv_bfloat162 ret_val;
-    unsigned short ret_val_x_cast =
-        *reinterpret_cast<unsigned short *>(&ret_val.x);
-    unsigned short ret_val_y_cast =
-        *reinterpret_cast<unsigned short *>(&ret_val.y);
+    unsigned short ret_val_x_cast;  // written by asm
+    unsigned short ret_val_y_cast;  // written by asm

Also applies to: 200-205, 246-250, 286-290

🤖 Prompt for AI Agents
In src/tl_templates/cuda/atomic.h around lines 158-163 (and similarly at
200-205, 246-250, 286-290), the code creates unsigned short variables by
reinterpret_casting uninitialized __half/__nv_bfloat16 locals which causes an
uninitialized read; instead, remove the reinterpret_cast reads and declare the
casted variables as plain unsigned short locals (uninitialized) so the inline
asm outputs can populate them, and for all bf16/hf16 codepaths ensure the asm()
uses the unsigned short output operands directly rather than reading from
uninitialized floating-point temporaries.

Comment on lines +163 to +169
if (memory_order == int(cuda::memory_order_release) ||
memory_order == int(cuda::memory_order_consume)) {
asm volatile(
"atom.release.gpu.global.add.noftz.v2.f16 {%0,%1}, [%2], {%3,%4};"
: "=h"(ret_val_x_cast), "=h"(ret_val_y_cast)
: "l"(ref_addr), "h"(add_val_x_cast), "h"(add_val_y_cast)
: "memory");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

memory_order_consume mapped to release — use acquire (or acq_rel for RMW).

On RMW atomics, consume should not be weaker than the load side; common practice maps consume to acquire. Mapping it to release loses required ordering for dependent reads.

-    if (memory_order == int(cuda::memory_order_release) ||
-        memory_order == int(cuda::memory_order_consume)) {
+    if (memory_order == int(cuda::memory_order_release)) {
       ...
-    } else if (memory_order == int(cuda::memory_order_acquire)) {
+    } else if (memory_order == int(cuda::memory_order_acquire) ||
+               memory_order == int(cuda::memory_order_consume)) {
       ...

Do this consistently in all x2/x4 paths. Also consider whether seq_cst should be stronger than acq_rel (e.g., add an extra fence) per your memory model.

Also applies to: 205-212, 250-256, 290-296, 324-330, 356-361, 389-397, 429-437

🤖 Prompt for AI Agents
In src/tl_templates/cuda/atomic.h around lines 163-169 (and similarly at
205-212, 250-256, 290-296, 324-330, 356-361, 389-397, 429-437), the branch
incorrectly treats memory_order_consume as release; change the emitted asm for
consume to use the acquire (or acq_rel for read-modify-write) variant rather
than the release variant, and make the mapping consistent across all x2/x4 code
paths; also review seq_cst cases and, if required by the memory model, add the
stronger fence (or use seq_cst-specific asm) so seq_cst is stronger than
acq_rel.

Comment on lines +226 to +227
return half2(*reinterpret_cast<__half *>(&ret_val_x_cast),
*reinterpret_cast<__half *>(&ret_val_y_cast));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix return construction for x2 returns (half2/bf16x2).

half2(__half, __half) and __nv_bfloat162(__nv_bfloat16, __nv_bfloat16) are not valid constructors; also avoid aliasing through reinterpret_cast. Use intrinsics or brace initialization with proper bit-casts.

If available in your CUDA version, prefer:

-    return half2(*reinterpret_cast<__half *>(&ret_val_x_cast),
-                 *reinterpret_cast<__half *>(&ret_val_y_cast));
+    return __halves2half2(__ushort_as_half(ret_val_x_cast),
+                          __ushort_as_half(ret_val_y_cast));

For bf16, use the bf16 equivalents (please confirm intrinsic names in your toolchain):

-    return __nv_bfloat162(*reinterpret_cast<__nv_bfloat16 *>(&ret_val_x_cast),
-                          *reinterpret_cast<__nv_bfloat16 *>(&ret_val_y_cast));
+    __nv_bfloat16 hx, hy;
+    memcpy(&hx, &ret_val_x_cast, sizeof(hx));  // avoids aliasing UB
+    memcpy(&hy, &ret_val_y_cast, sizeof(hy));
+    return __nv_bfloat162{hx, hy};

If you prefer no memcpy, introduce a small constexpr bit_cast helper for device code.

Also applies to: 308-310

Comment on lines +389 to +397
if (memory_order == int(cuda::memory_order_release) ||
memory_order == int(cuda::memory_order_consume)) {
asm volatile("atom.release.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
"{%5,%6,%7,%8};"
: "=f"(ret_val.x), "=f"(ret_val.y), "=f"(ret_val.z),
"=f"(ret_val.w)
: "l"(ref_addr), "f"(add_val.x), "f"(add_val.y),
"f"(add_val.z), "f"(add_val.w)
: "memory");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

PTX mnemonic order bug: uses atom.global.gpu.release...; should be atom.release.gpu.global....

You use atom.release.gpu.global... elsewhere, but in AtomicAddx4Ret the order is atom.global.gpu.release..., which will fail to assemble. Fix all three mnemonics in AtomicAddx4Ret to keep .semantics .scope .space order.

Apply:

-      asm volatile("atom.global.gpu.release.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.release.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
...
-      asm volatile("atom.global.gpu.acquire.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.acquire.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "
...
-      asm volatile("atom.global.gpu.acq_rel.add.v4.f32 {%0,%1,%2,%3}, [%4], "
+      asm volatile("atom.acq_rel.gpu.global.add.v4.f32 {%0,%1,%2,%3}, [%4], "

Also applies to: 431-455

🤖 Prompt for AI Agents
In src/tl_templates/cuda/atomic.h around lines 389-397 (and also apply the same
fix to 431-455), the PTX mnemonic components are in the wrong order in
AtomicAddx4Ret: currently using the `.space` and `.scope` parts before
`.semantics` (e.g., `atom.global.gpu.release...`) which fails to assemble;
change the three affected mnemonics in AtomicAddx4Ret to use the correct
`.semantics .scope .space` order (e.g., `atom.release.gpu.global...`) so all
three atom instructions match the correct PTX ordering used elsewhere.

@tzj-fxz tzj-fxz requested a review from LeiWang1999 October 23, 2025 06:11
@tzj-fxz tzj-fxz removed their assignment Oct 23, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tilelang/language/allocate.py (1)

36-40: Consider clarifying the bool type workaround comment.

Verification confirms no existing code is broken by the unconditional scope override—the only alloc_shared call with dtype="bool" and explicit scope at examples/deepseek_v32/sparse_mla_fwd_pipelined.py:100 passes scope="shared", which matches the override.

However, the design itself silently overrides the scope parameter for bool types. While this doesn't cause current breakage, consider:

  1. Formalizing the comment from informal style ("lei: This is a hack...") to professional documentation
  2. Adding a note in the docstring that bool types always allocate to "shared" scope due to merge smem pass limitations
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c65c9d3 and e275347.

📒 Files selected for processing (1)
  • tilelang/language/allocate.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/language/allocate.py (3)
src/transform/lower_opaque_block.cc (2)
  • annotations (208-248)
  • annotations (209-212)
testing/python/language/test_tilelang_language_alloc.py (1)
  • alloc_var (4-23)
src/transform/storage_rewrite.cc (4)
  • dtype (712-718)
  • dtype (712-712)
  • scope (678-683)
  • scope (678-678)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (2)
tilelang/language/allocate.py (2)

17-18: LGTM: Modern Python typing enabled.

The from __future__ import annotations import enables postponed evaluation of type annotations and allows the use of PEP 604 union syntax (|), which is used later in the file. This is a clean improvement aligned with modern Python practices.


71-71: LGTM: Type annotation improved.

The type annotation change from Union[PrimExpr] to PrimExpr | None is both a correctness improvement and modernization. The new annotation correctly indicates that init is optional and uses the cleaner PEP 604 union syntax.

@tzj-fxz tzj-fxz requested a review from Rachmanino October 23, 2025 07:25
LeiWang1999
LeiWang1999 previously approved these changes Oct 25, 2025
RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
* [Feature] Add memory_order PTX for vectorized (2x) atomic add

* [Feature] Add memory_order PTX for all vectorized atomic add

* [Lint]

* test

* [BugFix] FIx init optional argument in alloc_var

* bug fix

* bug fix

* lint fix

* lint fix

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants