[Feature] Add memory_order PTX for vectorized atomic add by tzj-fxz · Pull Request #1112 · tile-ai/tilelang

tzj-fxz · 2025-10-23T05:59:25Z

Summary by CodeRabbit

Chores
- Improved atomic operation handling on CUDA: vectorized atomics now respect explicit memory-order semantics and use safe fallbacks for non-relaxed orders, preserving prior fast paths for relaxed operations.
Refactor
- Modernized Python typing and annotation handling; added a special-case for boolean shared allocation to address a dtype workflow edge case.

coderabbitai · 2025-10-23T05:59:36Z

Walkthrough

The PR makes CUDA atomic templates memory-order-aware: relaxed fast-paths retained for half/bf16 and vectorized adds; non-relaxed orders use cuda::atomic_ref or PTX inline-assembly fallbacks. Also updates tilelang allocation: postponed annotations, bool special-case for shared scope, and a refined alloc_var annotation.

Changes

Cohort / File(s)	Change Summary
CUDA atomic templates `src/tl_templates/cuda/atomic.h`	Adds `#include <cuda_fp16.h>`; limits half/bf16 fast-paths for `AtomicMax`/`AtomicMin` variants to `memory_order::relaxed` (else use `cuda::atomic_ref`); extends `AtomicAddx2`/`AtomicAddx2Ret`/`AtomicAddx4`/`AtomicAddx4Ret` and bfloat16/half vectorized variants to implement non-relaxed fallbacks using PTX inline-assembly with branches for release/acquire/acq_rel/seq_cst while keeping original relaxed fast-paths.
TileLang allocation utilities `tilelang/language/allocate.py`	Adds `from __future__ import annotations`; special-cases `alloc_shared` to set `scope="shared"` when `dtype == "bool"`; updates `alloc_var` type annotation from `Union[PrimExpr]` to `PrimExpr

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant AtomicFunc as Atomic<Op>xN
  participant FastPath
  participant PTXFallback
  participant Memory

  Caller->>AtomicFunc: call AtomicAddxN(value, memory_order)
  alt memory_order == relaxed
    AtomicFunc->>FastPath: use relaxed reinterpret_cast fast-path
    FastPath->>Memory: atomic add (fast)
    FastPath-->>AtomicFunc: result
  else memory_order != relaxed
    AtomicFunc->>PTXFallback: emit PTX inline-atomic with ordering
    PTXFallback->>Memory: atomic add (PTX, ordered)
    PTXFallback-->>AtomicFunc: result
  end
  AtomicFunc-->>Caller: return result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[BugFix] Add memory order argument for non-vectorized atomic add #1081 — Adds memory_order propagation for atomic add operations; directly related to the new memory-order-aware atomic implementations.
[Feature]: Add test for atomicadd auto vectorize and remove useless code #1019 — Emits AtomicAddx2/AtomicAddx4 during vectorization; directly consumes the modified vectorized atomic paths.
[CI][Lint] Retire format.sh and add clang-tidy to GHA workflow #1044 — Prior changes to src/tl_templates/cuda/atomic.h touching fast-path/atomic handling; closely related code-level overlap.

Suggested reviewers

LeiWang1999

Poem

🐰 A rabbit hops where atomics play,
Fast paths sprint when orders say "relax",
PTX steps in when rules are stern,
Bits and halves in careful tracks,
A tiny hop, then all syncs back. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "[Feature] Add memory_order PTX for vectorized atomic add" accurately describes the primary and most substantial change in the changeset—the modifications to `src/tl_templates/cuda/atomic.h` that add memory-order-aware PTX-based inline atomic operations for vectorized atomic add operations. This matches the high code review effort estimate for atomic.h and reflects the core feature being added. While the PR also includes secondary changes to `tilelang/language/allocate.py` (annotations, bool dtype handling, and type annotation updates), these appear to be tangential modifications, and the title appropriately focuses on the main and most significant change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-10-23T05:59:40Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

src/tl_templates/cuda/atomic.h (1)

152-157: Prefer intrinsic bit-casts over reinterpret_cast through unrelated types.

Using reinterpret_cast<unsigned short *>(&add_val.x) and similar violates strict aliasing. Prefer CUDA intrinsics or safe bit_cast.

For half: use __half_as_ushort / __ushort_as_half, or std::bit_cast<unsigned short>(add_val.x) if available in your device C++20 mode.

For bf16: use the bf16 raw‑bit helpers in cuda_bf16.h (or memcpy as a portable fallback).

Also applies to: 194-199, 240-245, 280-285, 321-324, 353-354, 386-388, 426-428

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86c8bb4 and c65c9d3.

📒 Files selected for processing (1)

src/tl_templates/cuda/atomic.h (6 hunks)

🔇 Additional comments (4)

src/tl_templates/cuda/atomic.h (4)

49-56: Fast-path gating for min/max looks good; minor note on availability.

Conditionally using native atomics for half/__nv_bfloat16 only on relaxed order is sound. Ensure the native min/max overloads exist for your targeted toolchain; otherwise, keep the atomic_ref fallback.

Also applies to: 64-73, 81-88, 96-105

314-343: Float v2/v4 PTX paths: consistent and well-structured; mirror fixes from other comments.

The float vectorized paths look consistent. Apply the same consume mapping fix and maintain the .sem .scope .space order (already correct here). Also keep alignment/arch notes in mind.

Also applies to: 345-375, 377-417, 419-458

461-471: AtomicLoad/Store via cuda::atomic_ref — OK.

Looks correct and matches the scalar paths’ semantics.

7-12: Includes and type normalization — OK; minor portability note.

<cuda_fp16.h> addition and normalize_atomic_type specializations look fine. Confirm __CUDA_ARCH_LIST__ usage matches your build system; some environments gate on __CUDA_ARCH__ only.

Also applies to: 34-42

coderabbitai · 2025-10-23T06:06:39Z