[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd#1260
[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd#1260LeiWang1999 merged 4 commits intotile-ai:mainfrom
Conversation
WalkthroughRenamed loop indices in an NSA backward example to avoid shadowing, adjusted CUDA atomic selection to always use CAS/primitive paths for half and __nv_bfloat16 (regardless of memory_order), and promoted several atomic tests from JIT-decorated wrappers to plain pytest functions with expanded dtype coverage. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant atomic.h
rect rgb(240,248,255)
Note right of atomic.h: Old selection logic
Caller->>atomic.h: atomic_op(type T, order M)
atomic.h-->>atomic.h: if T is half/bf16 AND M == memory_order_relaxed
atomic.h-->>atomic.h: -> use primitive/CAS fast-path
atomic.h-->>atomic.h: else
atomic.h-->>Caller: use cuda::atomic_ref with order M
end
rect rgb(245,255,240)
Note right of atomic.h: New selection logic
Caller->>atomic.h: atomic_op(type T, order M)
atomic.h-->>atomic.h: if T is half/bf16
atomic.h-->>atomic.h: -> use primitive/CAS fast-path (regardless of M)
atomic.h-->>atomic.h: else
atomic.h-->>Caller: use cuda::atomic_ref with order M
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
@KevinZeng08 Thank you for your valuable suggestion. It will be more reliable to directly inline PTX for atomic operations with different memory orders, like what has already been done in vectorized dtype. I will check it and fix it later. |
|
Thanks for your kind reply. Looking forward to your implementation. |
|
@KevinZeng08 This new PR (#1264) was created based on your modification and suggestion. For the AtomicAdd operator, we inline PTX to handle different memory orders. For the AtomicMin/Max operators, we implement the atomicCAS loop to simulate the process because the original CUDA does not support a single 16-byte AtomicMin/Max. Could you check it if you are interested? :) |
|
Sure, let me have a check |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/tl_templates/cuda/atomic.h (1)
162-189: Clarify comment and verify unused variable handling.Line 162 states "Since atomic ref do not support memory order," but this is misleading. Looking at lines 66-67 and 220,
cuda::atomic_refdoes support memory_order for other types. The actual reason for inline PTX is that atomic_ref doesn't support 1- or 2-byte types (half/bf16), as noted in the PR objectives.Additionally,
ret_valis declared at line 166 but never used in the void-returningAtomicAddfunction. While the PTX assembly likely doesn't generate unnecessary code, this may be confusing for readers.Consider:
- Updating the comment to clarify that PTX is needed because atomic_ref doesn't support half/bf16 types
- Removing the unused
ret_valvariable or adding a comment explaining why it's present (if required by PTX syntax)testing/python/language/test_tilelang_language_atomic_add.py (1)
275-277: Consider expanding memory order coverage.The memory orders were changed from
relaxed/acquire/releasetorelease/relaxed/relaxed. While this exercises the new PTX paths foratomic_addwith release ordering, it reduces coverage foratomic_maxandatomic_min, which now only test relaxed ordering.Consider adding additional test cases to cover non-relaxed memory orders for
atomic_maxandatomic_minto ensure the CAS loop works correctly across all memory ordering semantics, even though the current CAS implementation doesn't distinguish between memory orders.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/tl_templates/cuda/atomic.h(6 hunks)testing/python/language/test_tilelang_language_atomic_add.py(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/language/test_tilelang_language_atomic_add.py
🧬 Code graph analysis (2)
src/tl_templates/cuda/atomic.h (1)
src/tl_templates/cpp/half.hpp (22)
half(2476-2479)half(2476-2476)half(2557-2557)half(2562-2562)half(3001-3001)half(3006-3008)half(3235-3237)half(3243-3243)half(3417-3425)half(3432-3440)half(5249-5251)int(760-765)int(772-779)int(787-797)int(804-811)int(816-821)int(827-832)int(838-843)int(855-864)int(872-879)int(892-914)int(5267-5273)
testing/python/language/test_tilelang_language_atomic_add.py (1)
tilelang/language/atomic.py (3)
atomic_add(116-235)atomic_max(22-65)atomic_min(68-113)
🔇 Additional comments (4)
src/tl_templates/cuda/atomic.h (1)
49-64: Verify NaN handling in float comparison within CAS loop.The CAS loop compares floating-point values using
val > *reinterpret_cast<T1 *>(&old_val_ushort). For half and bfloat16 types, this comparison may not handle NaN values correctly, as NaN comparisons always return false. This could lead to incorrect behavior when NaN values are involved in atomic max operations.Consider the semantics for NaN handling in your atomic max operations. If IEEE 754 NaN propagation is expected, you may need to add explicit NaN checks.
testing/python/language/test_tilelang_language_atomic_add.py (3)
239-261: LGTM: Test function refactoring.The explicit test functions improve test discoverability and remove the need for JIT-decorated wrappers. This is a cleaner approach.
263-263: LGTM: Debug path configuration.Adding
debug_root_pathto the decorator enables custom debug output location for this specific test program.
364-366: Excellent dtype coverage expansion.Testing across
float,float16, andbfloat16directly validates the new CAS-loop and PTX inline implementations for half-precision types. This comprehensive coverage is essential for the changes in this PR.Note:
bfloat16testing requires CUDA compute capability > 7.5 (Turing or newer) as per the preprocessor guards inatomic.hlines 24-28.
…e-ai#1260) * fix nsa bwd and atomic * [Lint] * [BugFix] - New implementation for atomicMax and atomicMin using atomicCAS - PTX version atomicAdd for single 16-byte data - Modify the test cases * [Lint] --------- Co-authored-by: tzj-fxz <tzjfxz@gmail.com>
What this PR do?
TODO in future
Summary by CodeRabbit
Refactor
Performance
Tests